@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> {
...
@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> {
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
* Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
* Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, ans I will treat them during the Reduce phrase.
```
```
public void map(Text key, Text value, Context context)
public void map(Text key, Text value, Context context)
...
@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context)
...
@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context)
}
}
```
```
**STEP 3: Reduce phrase**
**STEP 3: Reduce phrase**
In the reduce phrase,
In the reduce phrase,
* Firstly, process only keys without empty lines.
* Firstly, process only keys without empty lines.
* Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
* Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) are 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/NaiveApproach.java). I didn't commit the output since it's empty for the sample.
### Index Approach
In this method, I need create an inverted index, only for the first |d| - ⌈t |d|⌉ + 1 words of each document d. In my reducer, compute the similarity of the document pairs.
**STEP 1: Map phrase**
* compute the prefix filtering number `int filter`
* emit each prefix filtering word as `key` and the document id as `value`
```
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (value.toString().isEmpty()){
return;
}
String[] document = value.toString().split(",");
int filter = document.length - (int)Math.ceil(document.length*0.8) + 1;
int counter = 0;
//System.out.println(filter);
//System.out.println(document.length);
while(counter<filter){
word.set(document[counter]);
context.write(word,key);
counter += 1;
}
}
```
**STEP 2: Reduce phrase**
* we load the adapted pre-processed file into an HashMap object
* for each key, we get all possible combinations of documents pairs in the document list
* compute similarity of 2 documents in a pair if there is a pair. We use the same `similarity function` as before. Count comparaison times.
```
public void reduce(Text key, Iterable<Text> values, Context context)
context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
}
}
}
}
```
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach.
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.