diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index 8c9c8cb3139d6e8a74246cbb56f777d981bf953e..e7ada0e55146b7166439e3dd51a29d7a21e87270 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -1,5 +1,5 @@ # Assignment 2 for BDPA -### by Wenyao JIN +##### by Wenyao JIN --- ### Preprocessing the input #### 1. Remake the stopwords file @@ -40,7 +40,7 @@ In this step, several tasks should be done: * sort them by their pre-calculated frequency * output words with their line number as key -For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space. +For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a comma. The counter reveals a total of ```124787``` lines. ```java public void map(LongWritable key, Text value, Context context) @@ -70,9 +70,16 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--" return wordFreq.get(s1).compareTo(wordFreq.get(s2)); } }); - words.set(StringUtils.join(wordList," ")); + words.set(StringUtils.join(wordList,",")); context.write(new LongWritable(counter.getValue()), words); } - } ``` -The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). +The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber). + +--- +### Set similarity joins + + + + +