Update Report.md

b0c798fe · Meiqi Guo · bd5e2a6e · b0c798fe
Commit b0c798fe authored Mar 18, 2017 by Meiqi Guo
--- a/Report.md
+++ b/Report.md
@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> {
 To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.

 * Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
-* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase;
+* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, ans I will treat them during the Reduce phrase.

 ```
 public void map(Text key, Text value, Context context)
@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context)
 	      }
 ```
 **STEP 3: Reduce phrase**
+
 In the reduce phrase, 
 * Firstly, process only keys without empty lines. 
 * Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
@@ -292,13 +293,85 @@ public void reduce(TextPair key, Iterable<Text> values, Context context)
 	      }
 ```

-[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
-[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476.  Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
+[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
+
+[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) are 11476.  Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.

 You can find the overview of hadoop below:

 ![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG)

+See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/NaiveApproach.java). I didn't commit the output since it's empty for the sample.
+
+### Index Approach
+
+In this method, I need create an inverted index, only for the first |d| - ⌈t |d|⌉ + 1 words of each document d.  In my reducer, compute the similarity of the document pairs.
+
+
+**STEP 1: Map phrase**
+
+* compute the prefix filtering number `int filter`
+* emit each prefix filtering word as `key` and the document id as `value`
+
+```
+public void map(Text key, Text value, Context context)
+              throws IOException, InterruptedException {
+    	  
+    	 if (value.toString().isEmpty()){
+    		 return;
+    	 }
+    	 
+    	 String[] document = value.toString().split(",");
+    	 int filter = document.length - (int)Math.ceil(document.length*0.8) + 1;
+    	 int counter = 0;
+    	 //System.out.println(filter);
+    	 //System.out.println(document.length);
+    	 while(counter<filter){
+    		 word.set(document[counter]);
+    		 context.write(word,key);
+    		 counter += 1;
+    	 }
+      }
+```
+**STEP 2: Reduce phrase**
+
+* we load the adapted pre-processed file into an HashMap object
+* for each key, we get all possible combinations of documents pairs in the document list
+* compute similarity of 2 documents in a pair if there is a pair. We use the same `similarity function` as before. Count comparaison times.
+```
+public void reduce(Text key, Iterable<Text> values, Context context)
+              throws IOException, InterruptedException {
+		 List<Text> val = new ArrayList<Text>();
+		 for (Text v :values){
+			 val.add(new Text(v));
+		 }
+		 for (int i=0; i<val.size(); i++){
+			 Text v1 = val.get(i);
+			 for (int j=i+1; j<val.size(); j++){
+				 Text v2 = val.get(j);
+				 if (v1.equals(v2)){
+					 continue;
+				 }
+    			 String s1 = this.document.get(v1).toString();
+    			 String s2 = this.document.get(v2).toString();
+    	    	 context.getCounter(CompCounter2.NUM).increment(1);
+    			 Double s = similarity(s1, s2);
+    			 
+    			 if (s>=0.8){
+    				 context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
+    			 }
+    		 }
+    	 }
+      }
+```
+[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
+
+[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach. 
+
+You can find the overview of hadoop below:
+
+![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG)

+See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.