diff --git a/BDPA_Assign2_WJIN.md b/BDPA_Assign2_WJIN.md index bd9c9c4c9238b2807f10762efea95120265309f8..8f69dfbb5dcd7f4c0fb75d8a3d3dd622aac15fcd 100644 --- a/BDPA_Assign2_WJIN.md +++ b/BDPA_Assign2_WJIN.md @@ -249,6 +249,7 @@ To avoid redundant parses of the input file, some intuition is needed. In this a * In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected. * The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar. +Following code can be found [here](src/similarity/NaiveApproach.java). **Mapper:** ```java @Override @@ -344,6 +345,7 @@ In this part, the implementation is more trivial: * Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function. * At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed +Following code can be found [here](src/similarity/PrefilteringApproach.java). **Mapper:** ```java @Override