Update README.md

5d627b17 · Meiqi Guo · 5bdc284a · 5d627b17
Commit 5d627b17 authored Mar 18, 2017 by Meiqi Guo
--- a/README.md
+++ b/README.md
-# Big Data Process Assignment 2 _ Meiqi GUO
-## Pre-processing the input
-
-For the part of pre-procesing, the input consists of：
-* the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt)
-* the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1
-* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
-
-
-I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):
-
-
-**STEP 1: Remove all stopwords**
-```
-else if (stopWords.contains(word)){
-        		 continue;
-        		 }
-```
-
-**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
-```
-word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
-```
-
-**STEP 3：Keep each unique word only once per line**
-
-
-We define a *hashset* where we store words
-```
-Set<String> wordSet = new HashSet<String>();
-```
-
-**STEP 4: Remove empty lines**
-
-
-I removed firstly all empty lines:
-```
-if (value.toString().isEmpty()){
-    		 return;
-    	 }	
-```
-
-After removing stopwords and special characters, I removed all new empty lines:
-```
-if (wordSet.isEmpty()){
-        	 return;
-```
-**STEP 5: Count line numbers**
-
-
-I used two counters:
-* one is to note the number of lines in the initial document, named *LineNumCounter*; 
-* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. 
-
-The result is shown as below:
-
-NUM = 124787
-
-Final_NUM = 114815
-
-So nearly 10000 lines are empty.
-
-
-![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
-
-
-**STEP 6: Order the tokens of each line in ascending order of global frequency**
-
-I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
-```
-Collections.sort(wordList, new Comparator<String>() {
-        	 @Override
-        	 public int compare(String s1, String s2)
-        	 {
-        		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
-        	 }
-         });
-```
-
-You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess).
-
-
-All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java).
-