Skip to content
Snippets Groups Projects
Commit b0c798fe authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update Report.md

parent bd5e2a6e
No related branches found
No related tags found
No related merge requests found
...@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> { ...@@ -204,7 +204,7 @@ class TextPair implements WritableComparable<TextPair> {
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once. To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
* Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing; * Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
* We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, I will treat them in the Reduce phrase; * We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit `(id,i)`, where `id` is the document's id and `i` is for all in `1:n` and `i!=id`. Note that some documents may be empty, ans I will treat them during the Reduce phrase.
``` ```
public void map(Text key, Text value, Context context) public void map(Text key, Text value, Context context)
...@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context) ...@@ -241,6 +241,7 @@ public void map(Text key, Text value, Context context)
} }
``` ```
**STEP 3: Reduce phrase** **STEP 3: Reduce phrase**
In the reduce phrase, In the reduce phrase,
* Firstly, process only keys without empty lines. * Firstly, process only keys without empty lines.
* Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter. * Secondy, compare the similarity of the two documents for each key. Count the comparaison times by a counter.
...@@ -292,13 +293,85 @@ public void reduce(TextPair key, Iterable<Text> values, Context context) ...@@ -292,13 +293,85 @@ public void reduce(TextPair key, Iterable<Text> values, Context context)
} }
``` ```
[The excution](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`. [The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) is `4min, 15seconds`.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) is 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_Naive.PNG) are 11476. Knowing that we have [n=152 documents](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/LineNum_sample.PNG) in our sampled file, we find 11476=152*(152-1)/2. So, the algorithm worked as expected.
You can find the overview of hadoop below: You can find the overview of hadoop below:
![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG) ![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/hadoop_naive.PNG)
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/NaiveApproach.java). I didn't commit the output since it's empty for the sample.
### Index Approach
In this method, I need create an inverted index, only for the first |d| - ⌈t |d|⌉ + 1 words of each document d. In my reducer, compute the similarity of the document pairs.
**STEP 1: Map phrase**
* compute the prefix filtering number `int filter`
* emit each prefix filtering word as `key` and the document id as `value`
```
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (value.toString().isEmpty()){
return;
}
String[] document = value.toString().split(",");
int filter = document.length - (int)Math.ceil(document.length*0.8) + 1;
int counter = 0;
//System.out.println(filter);
//System.out.println(document.length);
while(counter<filter){
word.set(document[counter]);
context.write(word,key);
counter += 1;
}
}
```
**STEP 2: Reduce phrase**
* we load the adapted pre-processed file into an HashMap object
* for each key, we get all possible combinations of documents pairs in the document list
* compute similarity of 2 documents in a pair if there is a pair. We use the same `similarity function` as before. Count comparaison times.
```
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Text> val = new ArrayList<Text>();
for (Text v :values){
val.add(new Text(v));
}
for (int i=0; i<val.size(); i++){
Text v1 = val.get(i);
for (int j=i+1; j<val.size(); j++){
Text v2 = val.get(j);
if (v1.equals(v2)){
continue;
}
String s1 = this.document.get(v1).toString();
String s2 = this.document.get(v2).toString();
context.getCounter(CompCounter2.NUM).increment(1);
Double s = similarity(s1, s2);
if (s>=0.8){
context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
}
}
}
}
```
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach.
You can find the overview of hadoop below:
![](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG)
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment