Skip to content
Snippets Groups Projects
Commit 5d627b17 authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update README.md

parent 5bdc284a
No related branches found
No related tags found
No related merge requests found
# Big Data Process Assignment 2 _ Meiqi GUO
## Pre-processing the input
For the part of pre-procesing, the input consists of:
* the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt)
* the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1
* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):
**STEP 1: Remove all stopwords**
```
else if (stopWords.contains(word)){
continue;
}
```
**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
```
word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
```
**STEP 3:Keep each unique word only once per line**
We define a *hashset* where we store words
```
Set<String> wordSet = new HashSet<String>();
```
**STEP 4: Remove empty lines**
I removed firstly all empty lines:
```
if (value.toString().isEmpty()){
return;
}
```
After removing stopwords and special characters, I removed all new empty lines:
```
if (wordSet.isEmpty()){
return;
```
**STEP 5: Count line numbers**
I used two counters:
* one is to note the number of lines in the initial document, named *LineNumCounter*;
* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines.
The result is shown as below:
NUM = 124787
Final_NUM = 114815
So nearly 10000 lines are empty.
![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
**STEP 6: Order the tokens of each line in ascending order of global frequency**
I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
```
Collections.sort(wordList, new Comparator<String>() {
@Override
public int compare(String s1, String s2)
{
return wordFreq.get(s1).compareTo(wordFreq.get(s2));
}
});
```
You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess).
All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment