Skip to content
Snippets Groups Projects
Commit 78accfce authored by Wen Yao Jin's avatar Wen Yao Jin
Browse files

Merge branch 'master' of gitlab.my.ecp.fr:2014jinwy/BDPA_Assign2_WJIN

parents 64edcd4d ad38ee53
No related branches found
No related tags found
No related merge requests found
# Assignment # Assignment 2 for BDPA
### by Wenyao JIN
---
### Preprocessing the input
#### 1. Remake the stopwords file
By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file.
* take all three input files as before
* use space or "--" as tokenizer
* filter out all characters besides letters and numbers
* transform all words to lower case
* output is only count larger than 4000
```java
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (String token: value.toString().split("\\s+|-{2,}+")) {
word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase());
context.write(word, ONE);
}
}
```
The stop word file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/stopwords).
#### 2. Count word frequency of pg100.txt
By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq).
#### 3. Output lines
In this step, several tasks should be done:
* Store all stopwords in a set
* Store all word frequency in a hashmap
* For each line:
* keep counting line number with a counter
* skip empty lines
* separate words
* filter special characters
* take out words that are stopwords
* wipe out duplicates
* sort them by their pre-calculated frequency
* output words with their line number as key
For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space.
```java
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Counter counter = context.getCounter(DocLineCounter.NUM);
counter.increment(1);
Set<String> wordSet = new HashSet<String>();
if (value.toString().isEmpty()){
return;
}
for (String token: value.toString().split("\\s+|-{2,}+")) {
String s = token.replaceAll("[^A-Za-z0-9]+", "");
if (stopWords.contains(s)||(s.isEmpty())){
continue;
}else if(!wordFreq.containsKey(s)){
System.out.println("WARN: HASHTABLE DON'T HAVE WORD:");
System.out.println(s);
}
wordSet.add(s);
}
List<String> wordList = new ArrayList<String>(wordSet);
Collections.sort(wordList, new Comparator<String>() {
@Override
public int compare(String s1, String s2)
{
return wordFreq.get(s1).compareTo(wordFreq.get(s2));
}
});
words.set(StringUtils.join(wordList," "));
context.write(new LongWritable(counter.getValue()), words);
}
}
```
The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment