Skip to content
Snippets Groups Projects
Commit de504fc1 authored by Meiqi Guo's avatar Meiqi Guo
Browse files

Update Report.md

parent d23b39b4
No related branches found
No related tags found
No related merge requests found
...@@ -84,10 +84,10 @@ All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.f ...@@ -84,10 +84,10 @@ All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.f
## Set-similarity joins ## Set-similarity joins
> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that: > You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that:
each output line of the pre-processing job is a unique document (line number is the document id), * each output line of the pre-processing job is a unique document (line number is the document id),
documents are represented as sets of words, * documents are represented as sets of words,
sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|, * sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|,
t = 0.8. * t = 0.8.
For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines. For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines.
You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample). You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment