@@ -84,10 +84,10 @@ All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.f
...
@@ -84,10 +84,10 @@ All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.f
## Set-similarity joins
## Set-similarity joins
> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that:
> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that:
each output line of the pre-processing job is a unique document (line number is the document id),
* each output line of the pre-processing job is a unique document (line number is the document id),
For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines.
For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines.
You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).
You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).