diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md
index 8c9c8cb3139d6e8a74246cbb56f777d981bf953e..70598e3aee4fd063d626b9abb65617c4599e0326 100644
--- a/BDPA_Assign2_WJIN.md
+++ b/BDPA_Assign2_WJIN.md
@@ -1,14 +1,14 @@
 # Assignment 2 for BDPA
-### by Wenyao JIN
+##### by Wenyao JIN
 ---
 ### Preprocessing the input
 #### 1. Remake the stopwords file
 By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file. 
 * take all three input files as before
-* use space or "--" as tokenizer 
+* use `space` or `--` as tokenizer 
 * filter out all characters besides letters and numbers
 * transform all words to lower case
-* output is only count larger than 4000
+* output is only count larger than `4000`
 
 ```java
 
@@ -24,7 +24,7 @@ By slightly modifying the wordcount code from the previous assignment, we can ou
 The stop word file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/stopwords).
 
 #### 2. Count word frequency of pg100.txt
-By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq).
+By using again the wordcount algorithm, we recount the word frequency for `pg100.txt` to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq).
 
 #### 3. Output lines
 In this step, several tasks should be done:
@@ -40,7 +40,7 @@ In this step, several tasks should be done:
 	* sort them by their pre-calculated frequency 
 	* output words with their line number as key
 
-For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space.
+For this step, all task are done within the mapper. The tokenizer is `space` or `--` as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a `comma`. The counter reveals a total of `124787` lines.
 
 ```java
       public void map(LongWritable key, Text value, Context context)
@@ -70,9 +70,161 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--"
         		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
         	 }
          });
-         words.set(StringUtils.join(wordList," "));
+         words.set(StringUtils.join(wordList,","));
          context.write(new LongWritable(counter.getValue()), words);
       }
-   }
 ```
-The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline).
+The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber).
+
+---  
+### Set similarity joins
+#### 0. Primary implementation
+In this part, we need compare pairwise similarity. Before we do our implementations of two approaches, several basic modules need to be done.
+##### _Key Pair_
+In this mapreduce program, keys emitted from mappers will be pairs of keys. Thus, an implementation of new class key pair (in our case a pair of longwritables) is needed.
+
+Several remarks and intuition here:
+* LongPair need to implement `WritableComparable` interface in order to permit shuffle and order
+* Override function `equals` : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : `(A,B)` should equal `(B,A)`). So our function should inverse one pair to verify inequality before yield no.
+* Override function `compareTo` : The compare function has not much importance, but its difficulty lies in the necessity of coherence with `equals`. Here I proposed the method of comparing pairs by calculating a `sum` value(sum of two id) and a `difference` value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero.
+
+```java
+class LongPair implements WritableComparable<LongPair> {
+
+    private LongWritable first;
+    private LongWritable second;
+    
+    public LongPair() {
+        this.set(new LongWritable(0), new LongWritable(0));
+    }
+    
+    public LongPair(LongWritable first, LongWritable second) {
+        this.set(first, second);
+    }
+    
+    public LongPair(Long first, Long second) {
+        this.set(new LongWritable(first), new LongWritable(second));
+    }
+
+    public LongPair(String first, String second) {
+        this.set(new LongWritable( new Long(first)), new LongWritable( new Long(second)));
+    }
+
+    public LongWritable getFirst() {
+        return first;
+    }
+
+    public LongWritable getSecond() {
+        return second;
+    }
+
+    public void set(LongWritable first, LongWritable second) {
+        this.first = first;
+        this.second = second;
+    }    
+    
+    public void setFirst(LongWritable first){
+        this.first = first;
+    }
+    
+    public void setFirst(Long first){
+        this.first = new LongWritable(first);
+    }
+    
+    public void setSecond(LongWritable second){
+        this.second = second;
+    }
+    
+    public void setSecond(Long second){
+        this.second = new LongWritable(second);
+    }
+    
+    public long getSum(){
+    	return this.first.get()+this.second.get();
+    }
+    
+    public long getDiff(){
+    	return Math.abs(this.first.get()-this.second.get());
+    }
+    
+    public LongPair inverse(){
+    	return new LongPair(second, first);
+    }
+
+    @Override
+    public boolean equals(Object o) {
+        if (o instanceof LongPair) {
+            LongPair p1 = (LongPair) o;
+            boolean b1 = first.equals(p1.first) && second.equals(p1.second);
+            LongPair p2 = p1.inverse();
+            boolean b2 = first.equals(p2.first) && second.equals(p2.second);
+            return b1 || b2;
+        }
+        return false;
+    }
+    
+    @Override
+    public int compareTo(LongPair other) {
+    	long cmp = this.getSum()-other.getSum();
+    	long cmp_alter = this.getDiff() - other.getDiff();
+    	if(cmp<0){
+    		return 1;
+    	}else if(cmp>0){
+    		return -1;
+    	}else if(cmp_alter<0){
+    		return 1;
+    	}else if(cmp_alter>0){
+    		return -1;
+    	}
+    	return 0;
+    }
+    
+
+    @Override
+    public void readFields(DataInput in) throws IOException {
+        first.readFields(in);
+        second.readFields(in);
+    }
+
+    @Override
+    public void write(DataOutput out) throws IOException {
+        first.write(out);
+        second.write(out);
+    }
+
+    @Override
+    public String toString() {
+        return first.toString() + "," + second.toString();
+    }
+```
+
+##### _Similarity_
+To compute the similarity of two strings as instructed, we used a `Set` to store words. The advantage of set is its automatical ignorance of duplicates which enable quick calculation of union and intersection operations. 
+```java
+	   public double similarity(String t1, String t2) {
+
+		   Set<String> s1 = text2Set(t1);
+		   Set<String> s2 = text2Set(t2);
+		   
+		   Set<String> union = new HashSet<String>(s1);
+		   union.addAll(s2);
+		   
+		   Set<String> intersection = new HashSet<String>(s1);
+		   intersection.retainAll(s2);
+		   
+		   if (union.size()==0){
+			   return 0;
+		   }
+		   
+		   return intersection.size()/union.size();
+	    }
+```
+
+#### 1. Naive approach
+
+
+
+
+
+
+
diff --git a/img/.gitkeep b/img/.gitkeep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/img/naiveapproach.png b/img/naiveapproach.png
new file mode 100644
index 0000000000000000000000000000000000000000..3fb40e06c1cf627e1f9490c8edfc7e135bfd84f2
Binary files /dev/null and b/img/naiveapproach.png differ