diff --git a/Report.md b/Report.md index 7396223959751ef81bb92ea601364877abdc831d..db12d8d209790f81e06b5e286104c3a12192d770 100644 --- a/Report.md +++ b/Report.md @@ -83,11 +83,11 @@ You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProc All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java). ## Set-similarity joins -> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that: - each output line of the pre-processing job is a unique document (line number is the document id), - documents are represented as sets of words, - sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|, - t = 0.8. +> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that: +* each output line of the pre-processing job is a unique document (line number is the document id), +* documents are represented as sets of words, +* sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|, +* t = 0.8. For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines. You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).