From ff3b2c9534060deccd499c88bd1d82df0217fc95 Mon Sep 17 00:00:00 2001 From: Wen Yao Jin <wen-yao.jin@student.ecp.fr> Date: Sat, 11 Mar 2017 00:12:46 +0100 Subject: [PATCH] Update BDPA_Assign2_WJIN.md --- BDPA_Assign2_WJIN.md | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index 96c76c1..53f2ab3 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -1 +1,29 @@ -# Assignment +stopwords# Assignment 2 for BDPA +### by Wenyao JIN +--- +### Preprocessing the input +#### 1. Remake the stopwords file +By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file. +* take all three input files as before +* use space or "--" as tokenizer +* filter out all characters besides letters and numbers +* transform all words to lower case +* output is only count larger than 4000 + +```java + + public void map(LongWritable key, Text value, Context context) + throws IOException, InterruptedException { + for (String token: value.toString().split("\\s+|-{2,}+")) { + word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase()); + context.write(word, ONE); + } + } + +``` +The stop word file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/stopwords). + +#### 2. Count word frequency of pg100.txt +By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/wordfreq). + +#### 3. Output sorted lines \ No newline at end of file -- GitLab