From 6c4307b5ed63566aa952fe87b3dbf2335f6395f2 Mon Sep 17 00:00:00 2001
From: Wen Yao Jin <wen-yao.jin@student.ecp.fr>
Date: Sat, 11 Mar 2017 14:19:13 +0100
Subject: [PATCH] Update BDPA_Assign2_WJIN.md

---
 BDPA_Assign2_WJIN.md | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md
index 8c9c8cb..e7ada0e 100644
--- a/BDPA_Assign2_WJIN.md
+++ b/BDPA_Assign2_WJIN.md
@@ -1,5 +1,5 @@
 # Assignment 2 for BDPA
-### by Wenyao JIN
+##### by Wenyao JIN
 ---
 ### Preprocessing the input
 #### 1. Remake the stopwords file
@@ -40,7 +40,7 @@ In this step, several tasks should be done:
 	* sort them by their pre-calculated frequency 
 	* output words with their line number as key
 
-For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space.
+For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a comma. The counter reveals a total of ```124787``` lines.
 
 ```java
       public void map(LongWritable key, Text value, Context context)
@@ -70,9 +70,16 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--"
         		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
         	 }
          });
-         words.set(StringUtils.join(wordList," "));
+         words.set(StringUtils.join(wordList,","));
          context.write(new LongWritable(counter.getValue()), words);
       }
-   }
 ```
-The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline).
+The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber).
+
+---  
+### Set similarity joins
+
+
+
+
+
-- 
GitLab