@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> {
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
* Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, ans I will treat them during the Reduce phrase.
```
public void map(Text key, Text value, Context context)
...
...
@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context)
}
```
**STEP 3: Reduce phrase**
In the reduce phrase,
* Firstly, process only keys without empty lines.
* Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) are 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/NaiveApproach.java). I didn't commit the output since it's empty for the sample.
### Index Approach
In this method, I need create an inverted index, only for the first |d| - ⌈t |d|⌉ + 1 words of each document d. In my reducer, compute the similarity of the document pairs.
**STEP 1: Map phrase**
* compute the prefix filtering number `int filter`
* emit each prefix filtering word as `key` and the document id as `value`
```
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (value.toString().isEmpty()){
return;
}
String[] document = value.toString().split(",");
int filter = document.length - (int)Math.ceil(document.length*0.8) + 1;
int counter = 0;
//System.out.println(filter);
//System.out.println(document.length);
while(counter<filter){
word.set(document[counter]);
context.write(word,key);
counter += 1;
}
}
```
**STEP 2: Reduce phrase**
* we load the adapted pre-processed file into an HashMap object
* for each key, we get all possible combinations of documents pairs in the document list
* compute similarity of 2 documents in a pair if there is a pair. We use the same `similarity function` as before. Count comparaison times.
```
public void reduce(Text key, Iterable<Text> values, Context context)
context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
}
}
}
}
```
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach.
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.