From ad38ee531647c284bee4b25afbaa9b74402e063a Mon Sep 17 00:00:00 2001
From: Wen Yao Jin <wen-yao.jin@student.ecp.fr>
Date: Sat, 11 Mar 2017 00:32:45 +0100
Subject: [PATCH] Update BDPA_Assign2_WJIN.md

---
 BDPA_Assign2_WJIN.md | 53 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md
index 53f2ab3..8c9c8cb 100644
--- a/BDPA_Assign2_WJIN.md
+++ b/BDPA_Assign2_WJIN.md
@@ -1,4 +1,4 @@
-stopwords# Assignment 2 for BDPA
+# Assignment 2 for BDPA
 ### by Wenyao JIN
 ---
 ### Preprocessing the input
@@ -26,4 +26,53 @@ The stop word file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_A
 #### 2. Count word frequency of pg100.txt
 By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq).
 
-#### 3. Output sorted lines
\ No newline at end of file
+#### 3. Output lines
+In this step, several tasks should be done:
+* Store all stopwords in a set
+* Store all word frequency in a hashmap
+* For each line:
+	* keep counting line number with a counter 
+	* skip empty lines
+	* separate words
+	* filter special characters
+	* take out words that are stopwords
+	* wipe out duplicates
+	* sort them by their pre-calculated frequency 
+	* output words with their line number as key
+
+For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space.
+
+```java
+      public void map(LongWritable key, Text value, Context context)
+              throws IOException, InterruptedException {
+    	 Counter counter = context.getCounter(DocLineCounter.NUM);
+    	 counter.increment(1);
+    	 Set<String> wordSet = new HashSet<String>();
+    	 if (value.toString().isEmpty()){
+    		 return;
+    	 }
+         for (String token: value.toString().split("\\s+|-{2,}+")) {
+        	 String s = token.replaceAll("[^A-Za-z0-9]+", "");
+        	 if (stopWords.contains(s)||(s.isEmpty())){
+        		 continue;
+        	 }else if(!wordFreq.containsKey(s)){
+        		 System.out.println("WARN: HASHTABLE DON'T HAVE WORD:");
+        		 System.out.println(s);	 
+        	 }
+        	 wordSet.add(s);
+         }
+         List<String> wordList = new ArrayList<String>(wordSet);
+         
+         Collections.sort(wordList, new Comparator<String>() {
+        	 @Override
+        	 public int compare(String s1, String s2)
+        	 {
+        		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
+        	 }
+         });
+         words.set(StringUtils.join(wordList," "));
+         context.write(new LongWritable(counter.getValue()), words);
+      }
+   }
+```
+The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline).
-- 
GitLab