From 30634e04e87c95c5fd309813dea4dd5b1740700e Mon Sep 17 00:00:00 2001 From: Meiqi Guo <mei-qi.guo@student.ecp.fr> Date: Fri, 17 Mar 2017 02:51:30 +0100 Subject: [PATCH] Update README.md --- README.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 51 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 0ed83fe..b7a7798 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,60 @@ -# Big Data Process Assignment 2 +# Big Data Process Assignment 2 _ Meiqi GUO ## Pre-processing the input For the part of pre-procesing, the input consists of: * the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt) * the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1 -* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment -1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java). +* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java). +I do the following tasks in [Preprocess.java](): +**Remove all stopwords** +``` +else if (stopWords.contains(word)){ + continue; + } +``` +**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case** +``` +word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) +``` +**keep each unique word only once per line** +We define a *hashset* where we store words +``` +Set<String> wordSet = new HashSet<String>(); +``` +**Remove empty lines** +I removed firstly all empty lines: +``` +if (value.toString().isEmpty()){ + return; + } +``` +After removing stopwords and special characters, I removed all new empty lines: +``` +if (wordSet.isEmpty()){ + return; +``` +**Count line numbers** +I used two counters: +* one is to note the number of lines in the initial document, named *LineNumCounter*; +* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. +The result is shown as below: +[](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG) +**Order the tokens of each line in ascending order of global frequency** +I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency. +``` +Collections.sort(wordList, new Comparator<String>() { + @Override + public int compare(String s1, String s2) + { + return wordFreq.get(s1).compareTo(wordFreq.get(s2)); + } + }); +``` -All the details are written in my code Process.java. +You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess). + + +All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java). -- GitLab