From 30634e04e87c95c5fd309813dea4dd5b1740700e Mon Sep 17 00:00:00 2001
From: Meiqi Guo <mei-qi.guo@student.ecp.fr>
Date: Fri, 17 Mar 2017 02:51:30 +0100
Subject: [PATCH] Update README.md

---
 README.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 0ed83fe..b7a7798 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,60 @@
-# Big Data Process Assignment 2 
+# Big Data Process Assignment 2 _ Meiqi GUO
 ## Pre-processing the input
 For the part of pre-procesing, the input consists of:
 * the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt)
 * the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1
-* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment
-1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
+* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
 
+I do the following tasks in [Preprocess.java]():
+**Remove all stopwords**
+```
+else if (stopWords.contains(word)){
+        		 continue;
+        		 }
+```
+**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
+```
+word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
+```
+**keep each unique word only once per line**
+We define a *hashset* where we store words
+```
+Set<String> wordSet = new HashSet<String>();
+```
+**Remove empty lines**
+I removed firstly all empty lines:
+```
+if (value.toString().isEmpty()){
+    		 return;
+    	 }	
+```
+After removing stopwords and special characters, I removed all new empty lines:
+```
+if (wordSet.isEmpty()){
+        	 return;
+```
+**Count line numbers**
+I used two counters:
+* one is to note the number of lines in the initial document, named *LineNumCounter*; 
+* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. 
 
+The result is shown as below:
+[](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
 
+**Order the tokens of each line in ascending order of global frequency**
+I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
+```
+Collections.sort(wordList, new Comparator<String>() {
+        	 @Override
+        	 public int compare(String s1, String s2)
+        	 {
+        		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
+        	 }
+         });
+```
 
-All the details are written in my code Process.java.
+You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess).
+
+
+All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java).
 
-- 
GitLab