@@ -85,8 +85,8 @@ In this mapreduce program, keys emitted from mappers will be pairs of keys. Thus
...
@@ -85,8 +85,8 @@ In this mapreduce program, keys emitted from mappers will be pairs of keys. Thus
Several remarks and intuition here:
Several remarks and intuition here:
* LongPair need to implement `WritableComparable` interface in order to permit shuffle and order
* LongPair need to implement `WritableComparable` interface in order to permit shuffle and order
* Override function `equals` : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : `(A,B)` should equal `(B,A)`). So our function should inverse one pair to verify inequality before yield no.
* Override function `equals` : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : `(A,B)` should equal `(B,A)`). So our function should inverse one pair to verify equality too
* Override function `compareTo` : The compare function has not much importance, but its difficulty lies in the necessity of coherence with `equals`. Here I proposed the method of comparing pairs by calculating a `sum` value(sum of two id) and a `difference` value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero.
* Override function `compareTo` : The compare function has not much importance, but its difficulty lies in the necessity of coherence with `equals`. Here we propose the method of comparing pairs by calculating a `sum` value(sum of two id) and a `difference` value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero.
@@ -199,7 +199,7 @@ class LongPair implements WritableComparable<LongPair> {
...
@@ -199,7 +199,7 @@ class LongPair implements WritableComparable<LongPair> {
```
```
##### _Similarity_
##### _Similarity_
To compute the similarity of two strings as instructed, we used a `Set` to store words. The advantage of set is its automatical ignorance of duplicates which enable quick calculation of union and intersection operations.
To compute the similarity of two strings as instructed, we used a `Set` to store words. The advantage of set is its automatic ignorance of duplicates which enables quick calculation of union and intersection operations.
```java
```java
publicdoublesimilarity(Stringt1,Stringt2){
publicdoublesimilarity(Stringt1,Stringt2){
...
@@ -220,14 +220,14 @@ To compute the similarity of two strings as instructed, we used a `Set` to store
...
@@ -220,14 +220,14 @@ To compute the similarity of two strings as instructed, we used a `Set` to store
}
}
```
```
##### _File sampling_
##### _File sampling_
Our input file has more than `100000` documents. Consider pairwise calculation in the naive approah, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents. The files can be found [here](sortedline_sample) and [here](linenumber_sample).
Our input file has more than `100000` documents. If we consider pairwise calculation in the naive approach, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents (the first 1000 documents). The files can be found [here](sortedline_sample) and [here](linenumber_sample).
#### 1. Naive approach
#### 1. Naive approach
Instruction:
>Perform all pair
>Perform all pair
wise comparisons bet
wise comparisons bet
ween documents, using the following
ween documents, using the following technique
technique
:
:
Each document is handled by a single mapper (remember that lines are used to represent documents in this assignment).
Each document is handled by a single mapper (remember that lines are used to represent documents in this assignment).
The map method should emit, for
The map method should emit, for
...
@@ -244,7 +244,7 @@ Output only similar pairs on HDFS, in TextOutputFormat.
...
@@ -244,7 +244,7 @@ Output only similar pairs on HDFS, in TextOutputFormat.
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
*Stock in advance all document id (in our case we use the total line number that we got from the previous counter, id of empty documents will be ignore in the following)
*Load in advance all document id (in our case we use the total line number `n`that we got from the previous counter, so the id set is all the numbers in `1:n`, id of empty documents will be ignored in the following of the algorithm)
* For each id and document from the input, emit `(id,i), document` for all `i!=id` in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances.
* For each id and document from the input, emit `(id,i), document` for all `i!=id` in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances.
* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
...
@@ -322,6 +322,7 @@ The hadoop job overview:
...
@@ -322,6 +322,7 @@ The hadoop job overview:
`365085` similarity are calculated. Knowing that we have `n=855` documents in our sampled file, we find `365085=n*(n-1)/2`. So, the algorithm worked as expected.
`365085` similarity are calculated. Knowing that we have `n=855` documents in our sampled file, we find `365085=n*(n-1)/2`. So, the algorithm worked as expected.
#### 2. Pre-filtering approach
#### 2. Pre-filtering approach
Instruction:
>Create an inverted index, only for the first |d| -⌈t|d|⌉+ 1 words of each
>Create an inverted index, only for the first |d| -⌈t|d|⌉+ 1 words of each
document
document
d
d
...
@@ -339,9 +340,9 @@ comparisons
...
@@ -339,9 +340,9 @@ comparisons
.
.
In this part, the implementation is more trivial:
In this part, the implementation is more trivial:
* At the map phrase, inverse id and words, output all words that are in the window `|d| -⌈t|d|⌉+ 1`
* At the map phrase, inverse id and words, output separately all words that are in the window `|d| -⌈t|d|⌉+ 1` as key, and id as value
* Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function.
* Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function.
* At reduce phase, for every key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed
* At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed
**Mapper:**
**Mapper:**
```java
```java
...
@@ -396,11 +397,11 @@ The hadoop job overview:
...
@@ -396,11 +397,11 @@ The hadoop job overview:
**Execution time:**``15 s``
**Execution time:**``15 s``
`976` comparaison is made in this job, much less than the naive approach.
`976` comparaisons are made in this job, much less than the naive approach.
#### 3 Justification of difference
#### 3 Justification of difference
The output similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that, similar doc is very rare even compared to the sampled file length.
The output of similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that similar docs are very rare, compared to the sampled input file length.