From c6ee14f733691ad8087ca6fbd3bcc44292b0f2e2 Mon Sep 17 00:00:00 2001
From: Meiqi Guo <mei-qi.guo@student.ecp.fr>
Date: Sat, 18 Mar 2017 10:12:53 +0100
Subject: [PATCH] Update Report.md

---
 Report.md | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/Report.md b/Report.md
index db12d8d..6355013 100644
--- a/Report.md
+++ b/Report.md
@@ -93,6 +93,123 @@ For this part, I can't use directly 'pg100.txt' with 124787 lines because it wil
 You can find the sample text [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/input/pg100_Sample.txt) and the output file after preprocessing [here](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample).
 
 ### Naive Approach
+**STEP 1: Prepare Keypairs and make sure that the same pair of documents is compared no more than once**
+
+The map method should emit, for each document, the document id along with one other document id as a key (one such pair for each other document in the corpus) and the document’s content as a value.
+By doing this, we ensure that the shuffle and sort phase of the job will consider the pairs (doc1, doc2) and (doc2, doc1) as identical. This can be done by implementing our own WritableComparable class for the pair keys. We also need to specify a compareTo function adapted to our case. The class will be called TextPair, and will allow to use tuple keys containing 2 text objects:
+```
+class TextPair implements WritableComparable<TextPair> {
+
+	private Text first;
+	private Text second;
+
+	public TextPair(Text first, Text second) {
+		set(first, second);
+	}
+
+	public TextPair() {
+		set(new Text(), new Text());
+	}
+
+	public TextPair(String first, String second) {
+		set(new Text(first), new Text(second));
+	}
+
+	public Text getFirst() {
+		return first;
+	}
+
+	public Text getSecond() {
+		return second;
+	}
+
+	public void set(Text first, Text second) {
+		this.first = first;
+		this.second = second;
+	}
+
+	@Override
+	public void readFields(DataInput in) throws IOException {
+		first.readFields(in);
+		second.readFields(in);
+	}
+
+	@Override
+	public void write(DataOutput out) throws IOException {
+		first.write(out);
+		second.write(out);
+	}
+
+	@Override
+	public String toString() {
+		return first + " " + second;
+	}
+
+	@Override
+	public int compareTo(TextPair other) {
+		int cmpFirstFirst = first.compareTo(other.first);
+		int cmpSecondSecond = second.compareTo(other.second);
+		int cmpFirstSecond = first.compareTo(other.second);
+		int cmpSecondFirst = second.compareTo(other.first);
+
+		if (cmpFirstFirst == 0 && cmpSecondSecond == 0 || cmpFirstSecond == 0
+				&& cmpSecondFirst == 0) {
+			return 0;
+		}
+
+		Text thisSmaller;
+		Text otherSmaller;
+
+		Text thisBigger;
+		Text otherBigger;
+
+		if (this.first.compareTo(this.second) < 0) {
+			thisSmaller = this.first;
+			thisBigger = this.second;
+		} else {
+			thisSmaller = this.second;
+			thisBigger = this.first;
+		}
+
+		if (other.first.compareTo(other.second) < 0) {
+			otherSmaller = other.first;
+			otherBigger = other.second;
+		} else {
+			otherSmaller = other.second;
+			otherBigger = other.first;
+		}
+
+		int cmpThisSmallerOtherSmaller = thisSmaller.compareTo(otherSmaller);
+		int cmpThisBiggerOtherBigger = thisBigger.compareTo(otherBigger);
+
+		if (cmpThisSmallerOtherSmaller == 0) {
+			return cmpThisBiggerOtherBigger;
+		} else {
+			return cmpThisSmallerOtherSmaller;
+		}
+	}
+	@Override
+	public boolean equals(Object o) {
+		if (o instanceof TextPair) {
+			TextPair tp = (TextPair) o;
+			return first.equals(tp.first) && second.equals(tp.second);
+		}
+		return false;
+	}
+
+}
+```
+**STEP 2: Map phrase**
+
+To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.
+*Load in advance all document id (in our case we use the total line number n that we got from total line number counter while preprocessing;
+*We use [output_preprocess_sample](https://gitlab.my.ecp.fr/2014guom/BigDataProcessAssignment2/blob/master/output/output_preprocess_sample) as input and for each document from the input, emit (id,i), where >id is the document's id and >i is for all i in 1:n and i!=id. Note that some documents may be empty, I will treat them in the Reduce phrase;
+
+
+
+
+In the reduce phrase, process only keys with two instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
+The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.
 
 
 
-- 
GitLab