Skip to content
Snippets Groups Projects
Commit 13369f81 authored by Wen Yao Jin's avatar Wen Yao Jin
Browse files

Update BDPA_Assign2_WJIN.md

parent ff15a9c4
Branches
No related tags found
No related merge requests found
...@@ -219,9 +219,97 @@ To compute the similarity of two strings as instructed, we used a `Set` to store ...@@ -219,9 +219,97 @@ To compute the similarity of two strings as instructed, we used a `Set` to store
return intersection.size()/union.size(); return intersection.size()/union.size();
} }
``` ```
##### _File sampling_
Our input file has more than `100000` documents. Consider pairwise calculation in the naive approah, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents. The files can be found [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/sortedline_sample) and [here](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/linenumber_sample).
#### 1. Naive approach #### 1. Naive approach
>Perform all pair
wise comparisons bet
ween documents, using the following
technique
:
Each document is handled by a single mapper (remember that lines are
used to represent documents in this assignment).
The map method should emit, for
each document,
the document id along with one other document id as a key (one such
pair for each other document in the corpus) and the document
s
content as a value.
In the reduce phase, perform the Jaccard computation
s for all/some selected pairs
.
Output only similar pairs on HDFS, in TextOutputFormat.
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
* Stock in advance all document id (in our case we use the total line number that we got from the previous counter, id of empty documents will be ignore in the following)
* For each id and document from the input, emit `(id,i), document` for all `i!=id` in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances.
* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
```java
@Override
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
if (value.toString().isEmpty()){
return;
}
String keyOut = key.toString();
if (!StringUtils.isNumeric(keyOut)){
System.out.println("WARN: Bas input id");
System.out.println(keyOut);
return;
}
this.words = value;
this.keyPair.setFirst(Long.parseLong(keyOut));
long counter = 1;
while(counter<=this.fileLength){
this.keyPair.setSecond(counter);
if (this.keyPair.getDiff()==0){
counter += 1;
continue;
}
context.write(keyPair,words);
counter += 1;
}
```
```java
@Override
public void reduce(LongPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int counter = 0;
String[] strings = new String[2];
for (Text v : values){
strings[counter] = v.toString();
counter += 1;
}
if (counter!=2){ // document id not in input file
return;
}
double s = similarity(strings[0], strings[1]);
context.getCounter(CompCounter.NUM).increment(1);
if (s>=0.8){
context.write(key, new DoubleWritable(s));
}
}
```
The hadoop job overview:
![](https://gitlab.my.ecp.fr/2014jinwy/BDPA_Assign2_WJIN/blob/master/img/naiveapproach.png)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment