Skip to content
Snippets Groups Projects
Commit b93032f9 authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update README.md

parent 6bb4cfa9
No related branches found
No related tags found
No related merge requests found
......@@ -10,19 +10,19 @@ For the part of pre-procesing, the input consists of:
I do the following tasks in [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java):
**Remove all stopwords**
**STEP 1: Remove all stopwords**
```
else if (stopWords.contains(word)){
continue;
}
```
**Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
**STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case**
```
word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
```
**Keep each unique word only once per line**
**STEP 3:Keep each unique word only once per line**
We define a *hashset* where we store words
......@@ -30,7 +30,7 @@ We define a *hashset* where we store words
Set<String> wordSet = new HashSet<String>();
```
**Remove empty lines**
**STEP 4: Remove empty lines**
I removed firstly all empty lines:
......@@ -45,7 +45,7 @@ After removing stopwords and special characters, I removed all new empty lines:
if (wordSet.isEmpty()){
return;
```
**Count line numbers**
**STEP 5: Count line numbers**
I used two counters:
......@@ -64,7 +64,7 @@ So nearly 10000 lines are empty.
![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counters.PNG)
**Order the tokens of each line in ascending order of global frequency**
**STEP 6: Order the tokens of each line in ascending order of global frequency**
I used the [Words with frequency file](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/wordfreq) of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of [MyWordCount](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment1/blob/master/MyWordCount.java) as input and ordered tokens by their frequency.
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment