diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index 744a466f81a0872b420c820d7a86ab07c64c9608..bd9c9c4c9238b2807f10762efea95120265309f8 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -85,8 +85,8 @@ In this mapreduce program, keys emitted from mappers will be pairs of keys. Thus Several remarks and intuition here: * LongPair need to implement `WritableComparable` interface in order to permit shuffle and order -* Override function `equals` : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : `(A,B)` should equal `(B,A)`). So our function should inverse one pair to verify inequality before yield no. -* Override function `compareTo` : The compare function has not much importance, but its difficulty lies in the necessity of coherence with `equals`. Here I proposed the method of comparing pairs by calculating a `sum` value(sum of two id) and a `difference` value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero. +* Override function `equals` : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : `(A,B)` should equal `(B,A)`). So our function should inverse one pair to verify equality too +* Override function `compareTo` : The compare function has not much importance, but its difficulty lies in the necessity of coherence with `equals`. Here we propose the method of comparing pairs by calculating a `sum` value(sum of two id) and a `difference` value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero. ```java class LongPair implements WritableComparable<LongPair> { @@ -199,7 +199,7 @@ class LongPair implements WritableComparable<LongPair> { ``` ##### _Similarity_ -To compute the similarity of two strings as instructed, we used a `Set` to store words. The advantage of set is its automatical ignorance of duplicates which enable quick calculation of union and intersection operations. +To compute the similarity of two strings as instructed, we used a `Set` to store words. The advantage of set is its automatic ignorance of duplicates which enables quick calculation of union and intersection operations. ```java public double similarity(String t1, String t2) { @@ -220,14 +220,14 @@ To compute the similarity of two strings as instructed, we used a `Set` to store } ``` ##### _File sampling_ -Our input file has more than `100000` documents. Consider pairwise calculation in the naive approah, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents. The files can be found [here](sortedline_sample) and [here](linenumber_sample). +Our input file has more than `100000` documents. If we consider pairwise calculation in the naive approach, there will be more than `10^10` pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents (the first 1000 documents). The files can be found [here](sortedline_sample) and [here](linenumber_sample). #### 1. Naive approach +Instruction: >Perform all pair wise comparisons bet -ween documents, using the following -technique +ween documents, using the following technique : Each document is handled by a single mapper (remember that lines are used to represent documents in this assignment). The map method should emit, for @@ -244,7 +244,7 @@ Output only similar pairs on HDFS, in TextOutputFormat. To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once. -* Stock in advance all document id (in our case we use the total line number that we got from the previous counter, id of empty documents will be ignore in the following) +* Load in advance all document id (in our case we use the total line number `n` that we got from the previous counter, so the id set is all the numbers in `1:n`, id of empty documents will be ignored in the following of the algorithm) * For each id and document from the input, emit `(id,i), document` for all `i!=id` in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances. * In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected. * The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar. @@ -322,6 +322,7 @@ The hadoop job overview: `365085` similarity are calculated. Knowing that we have `n=855` documents in our sampled file, we find `365085=n*(n-1)/2`. So, the algorithm worked as expected. #### 2. Pre-filtering approach +Instruction: >Create an inverted index, only for the first |d| -⌈t|d|⌉+ 1 words of each document d @@ -339,9 +340,9 @@ comparisons . In this part, the implementation is more trivial: -* At the map phrase, inverse id and words, output all words that are in the window `|d| -⌈t|d|⌉+ 1` +* At the map phrase, inverse id and words, output separately all words that are in the window `|d| -⌈t|d|⌉+ 1` as key, and id as value * Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function. -* At reduce phase, for every key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed +* At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed **Mapper:** ```java @@ -396,11 +397,11 @@ The hadoop job overview: **Execution time:** ``15 s`` -`976` comparaison is made in this job, much less than the naive approach. +`976` comparaisons are made in this job, much less than the naive approach. #### 3 Justification of difference -The output similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that, similar doc is very rare even compared to the sampled file length. +The output of similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that similar docs are very rare, compared to the sampled input file length. | Job | # of comparaison | Execution Time | |:----------------:|:----------------:|:--------------:|