Snippets Groups Projects

BDPA_Assign2_WJIN.md

Update BDPA_Assign2_WJIN.md

Wen Yao Jin authored Mar 10, 2017

ff3b2c95

ff3b2c95 Mar 10, 2017

BDPA_Assign2_WJIN.md 1.24 KiB

stopwords# Assignment 2 for BDPA

by Wenyao JIN

Preprocessing the input

1. Remake the stopwords file

By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file.

take all three input files as before
use space or "--" as tokenizer
filter out all characters besides letters and numbers
transform all words to lower case
output is only count larger than 4000


      public void map(LongWritable key, Text value, Context context)
              throws IOException, InterruptedException {
         for (String token: value.toString().split("\\s+|-{2,}+")) {
        	 word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase());
            context.write(word, ONE);
         }
      }

The stop word file can be found here.

2. Count word frequency of pg100.txt

By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found here.

3. Output sorted lines