Skip to content
Snippets Groups Projects
Commit a9001215 authored by Wen Yao Jin's avatar Wen Yao Jin
Browse files

Update BDPA_Assign2_WJIN.md

parent 1b5fbbab
Branches
No related tags found
No related merge requests found
......@@ -249,6 +249,7 @@ To avoid redundant parses of the input file, some intuition is needed. In this a
* In the reduce phrase, process only keys with `two` instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
* The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
Following code can be found [here](src/similarity/NaiveApproach.java).
**Mapper:**
```java
@Override
......@@ -344,6 +345,7 @@ In this part, the implementation is more trivial:
* Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in `setup` function.
* At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed
Following code can be found [here](src/similarity/PrefilteringApproach.java).
**Mapper:**
```java
@Override
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment