From 2e632454e014c7055d69eaef2d322e9e02b820dc Mon Sep 17 00:00:00 2001 From: Wen Yao Jin <wen-yao.jin@student.ecp.fr> Date: Sat, 11 Mar 2017 21:01:05 +0100 Subject: [PATCH] Update BDPA_Assign2_WJIN.md --- BDPA_Assign2_WJIN.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index c6708ab..744a466 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -400,10 +400,13 @@ The hadoop job overview: #### 3 Justification of difference +The output similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that, similar doc is very rare even compared to the sampled file length. + | Job | # of comparaison | Execution Time | |:----------------:|:----------------:|:--------------:| | NaiveApproach | 365085 | 7m 50s | | PrefilteringApproach | 976 | 15s | -The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase. +The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase. + The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference. -- GitLab