@@ -249,6 +249,7 @@ To avoid redundant parses of the input file, some intuition is needed. In this a
* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
Following code can be found [here](src/similarity/NaiveApproach.java).
**Mapper:**
```java
@Override
...
...
@@ -344,6 +345,7 @@ In this part, the implementation is more trivial:
* Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function.
* At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed
Following code can be found [here](src/similarity/PrefilteringApproach.java).