From 5d627b17d80d20b2459e9da147ca850170e223d6 Mon Sep 17 00:00:00 2001 From: Meiqi Guo <mei-qi.guo@student.ecp.fr> Date: Sat, 18 Mar 2017 03:49:26 +0100 Subject: [PATCH] Update README.md --- README.md | 84 ------------------------------------------------------- 1 file changed, 84 deletions(-) diff --git a/README.md b/README.md index 355ae4f..e69de29 100644 --- a/README.md +++ b/README.md @@ -1,84 +0,0 @@ -# Big Data Process Assignment 2 _ Meiqi GUO -## Pre-processing the input - -For the part of pre-procesing, the input consists of: -* the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt) -* the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1 -* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java). - - -I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java): - - -**STEP 1: Remove all stopwords** -``` -else if (stopWords.contains(word)){ - continue; - } -``` - -**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case** -``` -word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()) -``` - -**STEP 3:Keep each unique word only once per line** - - -We define a *hashset* where we store words -``` -Set<String> wordSet = new HashSet<String>(); -``` - -**STEP 4: Remove empty lines** - - -I removed firstly all empty lines: -``` -if (value.toString().isEmpty()){ - return; - } -``` - -After removing stopwords and special characters, I removed all new empty lines: -``` -if (wordSet.isEmpty()){ - return; -``` -**STEP 5: Count line numbers** - - -I used two counters: -* one is to note the number of lines in the initial document, named *LineNumCounter*; -* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines. - -The result is shown as below: - -NUM = 124787 - -Final_NUM = 114815 - -So nearly 10000 lines are empty. - - - - - -**STEP 6: Order the tokens of each line in ascending order of global frequency** - -I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency. -``` -Collections.sort(wordList, new Comparator<String>() { - @Override - public int compare(String s1, String s2) - { - return wordFreq.get(s1).compareTo(wordFreq.get(s2)); - } - }); -``` - -You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess). - - -All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java). - -- GitLab