Update README.md

b93032f9 · Meiqi Guo · 6bb4cfa9 · b93032f9
Commit b93032f9 authored 8 years ago by Meiqi Guo
--- a/README.md
+++ b/README.md
@@ -10,19 +10,19 @@ For the part of pre-procesing, the input consists of：
 I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):


-**Remove all stopwords**
+**STEP 1: Remove all stopwords**
 ```
 else if (stopWords.contains(word)){
        		 continue;
        		 }
 ```

-**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
+**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
 ```
 word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
 ```

-**Keep each unique word only once per line**
+**STEP 3：Keep each unique word only once per line**


 We define a *hashset* where we store words
@@ -30,7 +30,7 @@ We define a *hashset* where we store words
 Set<String> wordSet = new HashSet<String>();
 ```

-**Remove empty lines**
+**STEP 4: Remove empty lines**


 I removed firstly all empty lines:
@@ -45,7 +45,7 @@ After removing stopwords and special characters, I removed all new empty lines:
 if (wordSet.isEmpty()){
        	 return;
 ```
-**Count line numbers**
+**STEP 5: Count line numbers**


 I used two counters:
@@ -64,7 +64,7 @@ So nearly 10000 lines are empty.
 ![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)


-**Order the tokens of each line in ascending order of global frequency**
+**STEP 6: Order the tokens of each line in ascending order of global frequency**

 I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
 ```