Skip to content
Snippets Groups Projects
Select Git revision
  • ff3b2c9534060deccd499c88bd1d82df0217fc95
  • master default
2 results

BDPA_Assign2_WJIN.md

Blame
  • user avatar
    Wen Yao Jin authored
    ff3b2c95
    History

    stopwords# Assignment 2 for BDPA

    by Wenyao JIN


    Preprocessing the input

    1. Remake the stopwords file

    By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file.

    • take all three input files as before
    • use space or "--" as tokenizer
    • filter out all characters besides letters and numbers
    • transform all words to lower case
    • output is only count larger than 4000
    
          public void map(LongWritable key, Text value, Context context)
                  throws IOException, InterruptedException {
             for (String token: value.toString().split("\\s+|-{2,}+")) {
            	 word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase());
                context.write(word, ONE);
             }
          }
          

    The stop word file can be found here.

    2. Count word frequency of pg100.txt

    By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found here.

    3. Output sorted lines