diff --git a/Report.md b/Report.md index ba15e43a915fe8f2c548f077cefbb15e8b4e9bcd..d34bb8ddbe4b1186204c0a9fe19899e8f283754b 100644 --- a/Report.md +++ b/Report.md @@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> { To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once. * Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing; -* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase; +* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, ans I will treat them during the Reduce phrase. ``` public void map(Text key, Text value, Context context) @@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context) } ``` **STEP 3: Reduce phrase** + In the reduce phrase, * Firstly, process only keys without empty lines. * Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter. @@ -292,13 +293,85 @@ public void reduce(TextPair key, Iterable<Text> values, Context context) } ``` -[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`. -[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected. +[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`. + +[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) are 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected. You can find the overview of hadoop below:  +See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/NaiveApproach.java). I didn't commit the output since it's empty for the sample. + +### Index Approach + +In this method, I need create an inverted index, only for the first |d| - ⌈t |d|⌉ + 1 words of each document d. In my reducer, compute the similarity of the document pairs. + + +**STEP 1: Map phrase** + +* compute the prefix filtering number `int filter` +* emit each prefix filtering word as `key` and the document id as `value` + +``` +public void map(Text key, Text value, Context context) + throws IOException, InterruptedException { + + if (value.toString().isEmpty()){ + return; + } + + String[] document = value.toString().split(","); + int filter = document.length - (int)Math.ceil(document.length*0.8) + 1; + int counter = 0; + //System.out.println(filter); + //System.out.println(document.length); + while(counter<filter){ + word.set(document[counter]); + context.write(word,key); + counter += 1; + } + } +``` +**STEP 2: Reduce phrase** + +* we load the adapted pre-processed file into an HashMap object +* for each key, we get all possible combinations of documents pairs in the document list +* compute similarity of 2 documents in a pair if there is a pair. We use the same `similarity function` as before. Count comparaison times. +``` +public void reduce(Text key, Iterable<Text> values, Context context) + throws IOException, InterruptedException { + List<Text> val = new ArrayList<Text>(); + for (Text v :values){ + val.add(new Text(v)); + } + for (int i=0; i<val.size(); i++){ + Text v1 = val.get(i); + for (int j=i+1; j<val.size(); j++){ + Text v2 = val.get(j); + if (v1.equals(v2)){ + continue; + } + String s1 = this.document.get(v1).toString(); + String s2 = this.document.get(v2).toString(); + context.getCounter(CompCounter2.NUM).increment(1); + Double s = similarity(s1, s2); + + if (s>=0.8){ + context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s))); + } + } + } + } +``` +[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach. + +[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach. + +You can find the overview of hadoop below: + + +See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.