@@ -202,14 +202,103 @@ class TextPair implements WritableComparable<TextPair> {
**STEP 2: Map phrase**
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
* Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
*We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit (id,i), where >id is the document's id and >i is for all i in 1:n and i!=id. Note that some documents may be empty, I will treat them in the Reduce phrase;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase;
```
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (value.toString().isEmpty()){
return;
}
String keyOut = key.toString();
if (!StringUtils.isNumeric(keyOut)){
System.out.println("WARN: Bas input id");
System.out.println(keyOut);
return;
}
this.words = value;
long counter = 1;
while(counter<=this.fileLength){
String counterstr = Long.toString(counter);
this.keyPair.set(key, new Text(counterstr));
if (key.equals(counterstr)){
counter += 1;
continue;
}
context.write(keyPair,words);
counter += 1;
}
}
```
**STEP 3: Reduce phrase**
In the reduce phrase,
* Firstly, process only keys without empty lines.
* Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
* Finally, emit key pairs that are similar.
The function for calculating the similarity is as below:
```
public double similarity(String t1, String t2) {
Set<String> s1 = text2Set(t1);
Set<String> s2 = text2Set(t2);
Set<String> union = new HashSet<String>(s1);
union.addAll(s2);
Set<String> intersection = new HashSet<String>(s1);
intersection.retainAll(s2);
if (union.size()==0){
return 0;
}
return intersection.size()/union.size();
}
```
Reducer:
```
public void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int counter = 0;
String[] strings = new String[2];
for (Text v : values){
strings[counter] = v.toString();
counter += 1;
}
if (counter!=2){ // document id not in input file
return;
}
double s = similarity(strings[0], strings[1]);
context.getCounter(CompCounter.NUM).increment(1);
if (s>=0.8){
context.write(new Text(key.toString()), new Text(String.valueOf(s)));
}
}
```
[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
In the reduce phrase, process only keys with two instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.