Update README.md

6d32b837 · Meiqi Guo · 30634e04 · 6d32b837
Commit 6d32b837 authored Mar 17, 2017 by Meiqi Guo
--- a/README.md
+++ b/README.md
 # Big Data Process Assignment 2 _ Meiqi GUO
 ## Pre-processing the input
+
 For the part of pre-procesing, the input consists of：
 * the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt)
 * the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1
-* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
+* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
+
+
+I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):
+

-I do the following tasks in [Preprocess.java]():
 **Remove all stopwords**
 ```
 else if (stopWords.contains(word)){
        		 continue;
        		 }
 ```
+
 **Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
 ```
 word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
 ```
+
 **keep each unique word only once per line**
+
+
 We define a *hashset* where we store words
 ```
 Set<String> wordSet = new HashSet<String>();
 ```
+
 **Remove empty lines**
+
+
 I removed firstly all empty lines:
 ```
 if (value.toString().isEmpty()){
    		 return;
    	 }	
 ```
+
 After removing stopwords and special characters, I removed all new empty lines:
 ```
 if (wordSet.isEmpty()){
        	 return;
 ```
 **Count line numbers**
+
+
 I used two counters:
 * one is to note the number of lines in the initial document, named *LineNumCounter*; 
 * the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. 

 The result is shown as below:
-[](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
+![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)

 **Order the tokens of each line in ascending order of global frequency**
 I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.