Skip to content
Snippets Groups Projects
Commit 6c4307b5 authored by Wen Yao Jin's avatar Wen Yao Jin
Browse files

Update BDPA_Assign2_WJIN.md

parent 78accfce
No related branches found
No related tags found
No related merge requests found
# Assignment 2 for BDPA # Assignment 2 for BDPA
### by Wenyao JIN ##### by Wenyao JIN
--- ---
### Preprocessing the input ### Preprocessing the input
#### 1. Remake the stopwords file #### 1. Remake the stopwords file
...@@ -40,7 +40,7 @@ In this step, several tasks should be done: ...@@ -40,7 +40,7 @@ In this step, several tasks should be done:
* sort them by their pre-calculated frequency * sort them by their pre-calculated frequency
* output words with their line number as key * output words with their line number as key
For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space. For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a comma. The counter reveals a total of ```124787``` lines.
```java ```java
public void map(LongWritable key, Text value, Context context) public void map(LongWritable key, Text value, Context context)
...@@ -70,9 +70,16 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--" ...@@ -70,9 +70,16 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--"
return wordFreq.get(s1).compareTo(wordFreq.get(s2)); return wordFreq.get(s1).compareTo(wordFreq.get(s2));
} }
}); });
words.set(StringUtils.join(wordList," ")); words.set(StringUtils.join(wordList,","));
context.write(new LongWritable(counter.getValue()), words); context.write(new LongWritable(counter.getValue()), words);
} }
}
``` ```
The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber).
---
### Set similarity joins
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment