diff --git a/README.md b/README.md index b7a77989c944a947f8f3b8641844665100cfdc78..e8c61c587b105a595f6deaf1a95056ed603cea9f 100644 --- a/README.md +++ b/README.md @@ -1,45 +1,59 @@ # Big Data Process Assignment 2 _ Meiqi GUO ## Pre-processing the input + For the part of pre-procesing, the input consists of: * the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt) * the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1 -* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java). +* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java). + + +I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java): + -I do the following tasks in [Preprocess.java](): **Remove all stopwords** ``` else if (stopWords.contains(word)){ continue; } ``` + **Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case** ``` word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) ``` + **keep each unique word only once per line** + + We define a *hashset* where we store words ``` Set<String> wordSet = new HashSet<String>(); ``` + **Remove empty lines** + + I removed firstly all empty lines: ``` if (value.toString().isEmpty()){ return; } ``` + After removing stopwords and special characters, I removed all new empty lines: ``` if (wordSet.isEmpty()){ return; ``` **Count line numbers** + + I used two counters: * one is to note the number of lines in the initial document, named *LineNumCounter*; * the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. The result is shown as below: -[](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG) + **Order the tokens of each line in ascending order of global frequency** I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.