From 6bb4cfa97d53ed1fa964515b8bcf22e29edf86e9 Mon Sep 17 00:00:00 2001 From: Meiqi Guo <mei-qi.guo@student.ecp.fr> Date: Fri, 17 Mar 2017 03:04:37 +0100 Subject: [PATCH] Update README.md --- README.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e65b96a..85a5e27 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ else if (stopWords.contains(word)){ word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) ``` -**keep each unique word only once per line** +**Keep each unique word only once per line** We define a *hashset* where we store words @@ -53,8 +53,16 @@ I used two counters: * the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. The result is shown as below: + +NUM = 124787 + +Final_NUM = 114815 + +So nearly 10000 lines are empty. + +  -<img src="https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG" width="100px" height="80px" alt="简书"> + **Order the tokens of each line in ascending order of global frequency** -- GitLab