Skip to content
Snippets Groups Projects
Commit 2e632454 authored by Wen Yao Jin's avatar Wen Yao Jin
Browse files

Update BDPA_Assign2_WJIN.md

parent 8048e6bd
No related branches found
No related tags found
No related merge requests found
...@@ -400,10 +400,13 @@ The hadoop job overview: ...@@ -400,10 +400,13 @@ The hadoop job overview:
#### 3 Justification of difference #### 3 Justification of difference
The output similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that, similar doc is very rare even compared to the sampled file length.
| Job | # of comparaison | Execution Time | | Job | # of comparaison | Execution Time |
|:----------------:|:----------------:|:--------------:| |:----------------:|:----------------:|:--------------:|
| NaiveApproach | 365085 | 7m 50s | | NaiveApproach | 365085 | 7m 50s |
| PrefilteringApproach | 976 | 15s | | PrefilteringApproach | 976 | 15s |
The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase. The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase.
The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference. The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment