Merge branch 'master' of gitlab.my.ecp.fr:2014jinwy/BDPA_Assign2_WJIN

8048e6bd · Wen Yao Jin · 180a6405 · c4171027 · 8048e6bd · 8048e6bd
Commit 8048e6bd authored 8 years ago by Wen Yao Jin
--- a/BDPA_Assign2_WJIN.md
+++ b/BDPA_Assign2_WJIN.md
@@ -21,10 +21,10 @@ By slightly modifying the wordcount code from the previous assignment, we can ou
      }
 ```
-The stop word file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/stopwords).
+The stop word file can be found [here](stopwords).
 #### 2. Count word frequency of pg100.txt
-By using again the wordcount algorithm, we recount the word frequency for `pg100.txt` to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq).
+By using again the wordcount algorithm, we recount the word frequency for `pg100.txt` to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](wordfreq).
 #### 3. Output lines
 In this step, several tasks should be done:
@@ -74,7 +74,7 @@ For this step, all task are done within the mapper. The tokenizer is `space` or
         context.write(new LongWritable(counter.getValue()), words);
      }
 ```
-The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber).
+The output file can be found [here](sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](linenumber).
 ---  
 ### Set similarity joins
@@ -219,12 +219,191 @@ To compute the similarity of two strings as instructed, we used a `Set` to store
 		   return intersection.size()/union.size();
 	    }
 ```
+##### _File sampling_
+Our input file has more than `100000` documents. Consider pairwise calculation in the naive approah, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents. The files can be found [here](sortedline_sample) and [here](linenumber_sample).
 #### 1. Naive approach
+>Perform  all  pair
+wise  comparisons  bet
+ween  documents,  using  the  following
+technique
+:
+Each document is handled by a single mapper (remember that lines are used to represent documents in this assignment). 
+The map method should emit, for 
+each document, 
+the document id along with one other document id as a key (one such 
+pair for each other document in the corpus) and the document
+’
+s
+content as a value. 
+In the reduce phase, perform the Jaccard computation
+s for all/some selected pairs
+. 
+Output only similar pairs on HDFS, in TextOutputFormat.
+To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
+* Stock in advance all document id (in our case we use the total line number that we got from the previous counter, id of empty documents will be ignore in the following)
+* For each id and document from the input, emit `(id,i), document` for all `i!=id` in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances.
+* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
+* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.  
+**Mapper:**
+```java
+      @Override
+      public void map(Text key, Text value, Context context)
+              throws IOException, InterruptedException {
+    	 if (value.toString().isEmpty()){
+    		 return;
+    	 }
+    	 String keyOut = key.toString();
+    	 if (!StringUtils.isNumeric(keyOut)){
+    		 System.out.println("WARN: Bas input id");
+    		 System.out.println(keyOut);
+    		 return;
+    	 }
+    	 this.words = value;
+    	 this.keyPair.setFirst(Long.parseLong(keyOut));
+    	 long counter = 1;
+    	 while(counter<=this.fileLength){
+    		 this.keyPair.setSecond(counter);
+    		 if (this.keyPair.getDiff()==0){
+    			 counter += 1;
+    			 continue;
+    		 }
+    		 context.write(keyPair,words);
+    		 counter += 1;
+    	 }
+```
+**Reducer:**
+```java
+	  @Override
+      public void reduce(LongPair key, Iterable<Text> values, Context context)
+              throws IOException, InterruptedException {
+    	 int counter = 0;
+		 String[] strings =  new String[2];
+    	 for (Text v : values){
+    		 strings[counter] = v.toString();
+    		 counter += 1;
+    	 }
+    	 if (counter!=2){ // document id not in input file 
+    		return;
+    	 }
+    	 double s = similarity(strings[0], strings[1]);
+    	 context.getCounter(CompCounter.NUM).increment(1);
+    	 if (s>=0.8){
+             context.write(new Text(key.toString()), new Text(String.valueOf(s))); 
+    	 }
+      }
+```
+The number of mapper can be changed with this line:
+```java
+      conf.set("mapreduce.input.fileinputformat.split.maxsize", String.valueOf(500));
+```
+The hadoop job overview:
+![](img/naiveapproach.png)
+**Execution time:** ``7m 50s``
+`365085` similarity are calculated. Knowing that we have `n=855` documents in our sampled file, we find `365085=n*(n-1)/2`. So, the algorithm worked as expected.
+#### 2. Pre-filtering approach
+>Create  an  inverted  index,  only  for  the  first  |d| -⌈t|d|⌉+  1  words  of  each 
+document 
+d
+(remember that they are stored in ascending order of frequency)
+. 
+In your 
+reducer, compute the similarity of the 
+document pairs. Output only similar pairs
+on 
+HDFS, in TextOutputFormat
+. 
+Report the execution time and
+the number of performed 
+comparisons
+.
+In this part, the implementation is more trivial:
+* At the map phrase, inverse id and words, output all words that are in the window `|d| -⌈t|d|⌉+  1`
+* Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function.
+* At reduce phase, for every key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed
+**Mapper:**
+```java
+@Override
+      public void map(Text key, Text value, Context context)
+              throws IOException, InterruptedException {
+    	 if (value.toString().isEmpty()){
+    		 return;
+    	 }
+    	 String[] document = value.toString().split(",");
+    	 int window = document.length - (int)Math.floor(document.length*0.8);
+    	 int counter = 0;
+    	 while(counter<window){
+    		 word.set(document[counter]);
+    		 context.write(word,key);
+    		 counter += 1;
+    	 }
+      }
+```
+**Reducer:**
+```java
+@Override
+      public void reduce(Text key, Iterable<Text> values, Context context)
+              throws IOException, InterruptedException {
+		 List<Text> val = new ArrayList<Text>();
+		 for (Text v :values){
+			 val.add(new Text(v));
+		 }
+    	 for (Text v1 : val){
+    		 for (Text v2: val){
+    			 if (v1.equals(v2)){
+    				 continue;
+    			 }
+    			 String s1 = this.document.get(v1).toString();
+    			 String s2 = this.document.get(v2).toString();
+    	    	 context.getCounter(CompCounter.NUM).increment(1);
+    			 Double s = similarity(s1, s2);
+    			 if (s>=0.8){
+    				 context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
+    			 }
+    		 }
+    	 }
+```
+The hadoop job overview:
+![](img/prefilteringapproach.png)
+**Execution time:** ``15 s``
+`976` comparaison is made in this job, much less than the naive approach.
+#### 3 Justification of difference
+| Job       | # of comparaison | Execution Time |
+|:----------------:|:----------------:|:--------------:|
+| NaiveApproach               | 365085           | 7m 50s         |
+| PrefilteringApproach               | 976               | 15s         |
+The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase.  
+The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference.
--- a/img/prefilteringapproach.png
+++ b/img/prefilteringapproach.png