From d23b39b48d431a419561d688250657ea91038505 Mon Sep 17 00:00:00 2001
From: Meiqi Guo <mei-qi.guo@student.ecp.fr>
Date: Sat, 18 Mar 2017 09:53:13 +0100
Subject: [PATCH] Update Report.md

---
 Report.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/Report.md b/Report.md
index 58da43f..7396223 100644
--- a/Report.md
+++ b/Report.md
@@ -83,6 +83,16 @@ You can see the output file [here](https://gitlab.my.ecp.fr/2014guom/BigDataProc
 All the details are written in my code [Preprocess.java](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/Preprocess.java).
 
 ## Set-similarity joins 
+> You are asked to efficiently identify all pairs of documents (d1, d2) that are similar (sim(d1, d2) >= t), given a similarity function sim and a similarity threshold t. Specifically, assume that:
+ each output line of the pre-processing job is a unique document (line number is the document id),
+ documents are represented as sets of words,
+ sim(d1, d2) = Jaccard(d1, d2) = |d1 Ո d2| / |d1 U d2|,
+ t = 0.8.
+
+For this part, I can't use directly 'pg100.txt' with 124787 lines because it will take too much time while finding similary documents. So I made a sample where I chose the first ten sonnets with total 174 lines. 
+You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).
+
+### Naive Approach
+
 
-For this part, I can't use directly 'pg100.txt' with 12
 
-- 
GitLab