Skip to content
Snippets Groups Projects
Commit 6d32b837 authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update README.md

parent 30634e04
Branches
No related tags found
No related merge requests found
# Big Data Process Assignment 2 _ Meiqi GUO
## Pre-processing the input
For the part of pre-procesing, the input consists of:
* the document corpus of [pg100.txt](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100.txt)
* the [Stopword file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/Stopwords) which I made in the assignment 1
* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
* the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java).
I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):
I do the following tasks in [Preprocess.java]():
**Remove all stopwords**
```
else if (stopWords.contains(word)){
continue;
}
```
**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
```
word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
```
**keep each unique word only once per line**
We define a *hashset* where we store words
```
Set<String> wordSet = new HashSet<String>();
```
**Remove empty lines**
I removed firstly all empty lines:
```
if (value.toString().isEmpty()){
return;
}
```
After removing stopwords and special characters, I removed all new empty lines:
```
if (wordSet.isEmpty()){
return;
```
**Count line numbers**
I used two counters:
* one is to note the number of lines in the initial document, named *LineNumCounter*;
* the other one is to record the number of lines for the output, named *FinalLineNumCounter*, which means the number after removing all empty lines.
The result is shown as below:
[](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
**Order the tokens of each line in ascending order of global frequency**
I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment