diff --git a/README.md b/README.md index 85a5e27e4532c233cb376815360255d8a3010973..355ae4f37887f2490db8dadce312db41c5758d0a 100644 --- a/README.md +++ b/README.md @@ -10,19 +10,19 @@ For the part of pre-procesing, the input consists of: I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java): -**Remove all stopwords** +**STEP 1: Remove all stopwords** ``` else if (stopWords.contains(word)){ continue; } ``` -**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case** +**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case** ``` word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) ``` -**Keep each unique word only once per line** +**STEP 3:Keep each unique word only once per line** We define a *hashset* where we store words @@ -30,7 +30,7 @@ We define a *hashset* where we store words Set<String> wordSet = new HashSet<String>(); ``` -**Remove empty lines** +**STEP 4: Remove empty lines** I removed firstly all empty lines: @@ -45,7 +45,7 @@ After removing stopwords and special characters, I removed all new empty lines: if (wordSet.isEmpty()){ return; ``` -**Count line numbers** +**STEP 5: Count line numbers** I used two counters: @@ -64,7 +64,7 @@ So nearly 10000 lines are empty.  -**Order the tokens of each line in ascending order of global frequency** +**STEP 6: Order the tokens of each line in ascending order of global frequency** I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency. ```