diff --git a/README.md b/README.md index e65b96a2cf40d4bd1a77238e1f769f3b959c372d..85a5e27e4532c233cb376815360255d8a3010973 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ else if (stopWords.contains(word)){ word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) ``` -**keep each unique word only once per line** +**Keep each unique word only once per line** We define a *hashset* where we store words @@ -53,8 +53,16 @@ I used two counters: * the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. The result is shown as below: + +NUM = 124787 + +Final_NUM = 114815 + +So nearly 10000 lines are empty. + +  -<img src="https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG" width="100px" height="80px" alt="简书"> + **Order the tokens of each line in ascending order of global frequency**