The output similar documents can be find [here](similardoc). Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that, similar doc is very rare even compared to the sampled file length.
The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase.
The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase.
The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference.
The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference.