Skip to content
Snippets Groups Projects
Commit d23b39b4 authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update Report.md

parent d0eeabda
No related branches found
No related tags found
No related merge requests found
...@@ -83,6 +83,16 @@ You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProc ...@@ -83,6 +83,16 @@ You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProc
All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java). All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java).
## Set-similarity joins ## Set-similarity joins
> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that:
 each output line of the pre-processing job is a unique document (line number is the document id),
 documents are represented as sets of words,
 sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|,
 t = 0.8.
For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines.
You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).
### Naive Approach
For this part, I can't use directly 'pg100.txt' with 12
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment