[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
[The excution time](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Hadoop_IndexApproach.PNG) is `42seconds`, much less than Naive Approach.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach.
[Comparaison times](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/counter_IndexApproach.PNG) are 17, much less than Naive Approach.
...
@@ -374,4 +375,14 @@ You can find the overview of hadoop below:
...
@@ -374,4 +375,14 @@ You can find the overview of hadoop below:
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.
See the complete code [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/IndexApproach.java). I didn't commit the output since it's empty for the sample.
### Explain and justify the difference
Methods of approach | Excution time | Comparaison times
We can clearly see that the Index Approach is quicker than the Naive Approach, even on a sample dataset.
This is raisonable because the second method aims at reducing the number of pair comparisions by the inverted index, which allows to skip the (huge) number of comparisons between some non-similar documents.
But the first method takes O(n) computational time and memory, thus needs much more time.