From 6c4307b5ed63566aa952fe87b3dbf2335f6395f2 Mon Sep 17 00:00:00 2001 From: Wen Yao Jin <wen-yao.jin@student.ecp.fr> Date: Sat, 11 Mar 2017 14:19:13 +0100 Subject: [PATCH] Update BDPA_Assign2_WJIN.md --- BDPA_Assign2_WJIN.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index 8c9c8cb..e7ada0e 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -1,5 +1,5 @@ # Assignment 2 for BDPA -### by Wenyao JIN +##### by Wenyao JIN --- ### Preprocessing the input #### 1. Remake the stopwords file @@ -40,7 +40,7 @@ In this step, several tasks should be done: * sort them by their pre-calculated frequency * output words with their line number as key -For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a space. +For this step, all task are done within the mapper. The tokenizer is " " or "--" as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a comma. The counter reveals a total of ```124787``` lines. ```java public void map(LongWritable key, Text value, Context context) @@ -70,9 +70,16 @@ For this step, all task are done within the mapper. The tokenizer is " " or "--" return wordFreq.get(s1).compareTo(wordFreq.get(s2)); } }); - words.set(StringUtils.join(wordList," ")); + words.set(StringUtils.join(wordList,",")); context.write(new LongWritable(counter.getValue()), words); } - } ``` -The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). +The output file can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline). The total line number is output to HDFS as instructed, you can also find the file [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber). + +--- +### Set similarity joins + + + + + -- GitLab