Skip to content
Snippets Groups Projects
Select Git revision
  • master
1 result

BDPA_Assign2_WJIN.md

Blame
  • user avatar
    Wen Yao Jin authored
    3266f54a
    History

    Assignment 2 for BDPA

    by Wenyao JIN

    Preprocessing the input

    1. Remake the stopwords file

    By slightly modifying the wordcount code from the previous assignment, we can output a stopwords file.

    • take all three input files as before
    • use space or -- as tokenizer
    • filter out all characters besides letters and numbers
    • transform all words to lower case
    • output is only count larger than 4000
    
          public void map(LongWritable key, Text value, Context context)
                  throws IOException, InterruptedException {
             for (String token: value.toString().split("\\s+|-{2,}+")) {
            	 word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase());
                context.write(word, ONE);
             }
          }
          

    The stop word file can be found here.

    2. Count word frequency of pg100.txt

    By using again the wordcount algorithm, we recount the word frequency for pg100.txt to be used later for word sorting. This time capital cases are kept to be taken acount in the similarity comparison. The output file can be found here.

    3. Output lines

    In this step, several tasks should be done:

    • Store all stopwords in a set
    • Store all word frequency in a hashmap
    • For each line:
      • keep counting line number with a counter
      • skip empty lines
      • separate words
      • filter special characters
      • take out words that are stopwords
      • wipe out duplicates
      • sort them by their pre-calculated frequency
      • output words with their line number as key

    For this step, all task are done within the mapper. The tokenizer is space or -- as before. A set container is used to avoid duplicates. Java's build-in sort function is applied with a costumed compare function incorporating the word frequency. StringUtils's join function serves to join words together with a comma. The counter reveals a total of 124787 lines.

          public void map(LongWritable key, Text value, Context context)
                  throws IOException, InterruptedException {
        	 Counter counter = context.getCounter(DocLineCounter.NUM);
        	 counter.increment(1);
        	 Set<String> wordSet = new HashSet<String>();
        	 if (value.toString().isEmpty()){
        		 return;
        	 }
             for (String token: value.toString().split("\\s+|-{2,}+")) {
            	 String s = token.replaceAll("[^A-Za-z0-9]+", "");
            	 if (stopWords.contains(s)||(s.isEmpty())){
            		 continue;
            	 }else if(!wordFreq.containsKey(s)){
            		 System.out.println("WARN: HASHTABLE DON'T HAVE WORD:");
            		 System.out.println(s);	 
            	 }
            	 wordSet.add(s);
             }
             List<String> wordList = new ArrayList<String>(wordSet);
             
             Collections.sort(wordList, new Comparator<String>() {
            	 @Override
            	 public int compare(String s1, String s2)
            	 {
            		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
            	 }
             });
             words.set(StringUtils.join(wordList,","));
             context.write(new LongWritable(counter.getValue()), words);
          }

    The output file can be found here. The total line number is output to HDFS as instructed, you can also find the file here.


    Set similarity joins

    0. Primary implementation

    In this part, we need compare pairwise similarity. Before we do our implementations of two approaches, several basic modules need to be done.

    Key Pair

    In this mapreduce program, keys emitted from mappers will be pairs of keys. Thus, an implementation of new class key pair (in our case a pair of longwritables) is needed.

    Several remarks and intuition here:

    • LongPair need to implement WritableComparable interface in order to permit shuffle and order
    • Override function equals : We should see that order within the pairs should not be taken into account when checking two pairs are equal or not (For example : (A,B) should equal (B,A)). So our function should inverse one pair to verify equality too
    • Override function compareTo : The compare function has not much importance, but its difficulty lies in the necessity of coherence with equals. Here we propose the method of comparing pairs by calculating a sum value(sum of two id) and a difference value(absolute difference of two id). We can check that 2 pairs can be equal if and only if pairwise difference of this two values are both zero.
    class LongPair implements WritableComparable<LongPair> {
    
        private LongWritable first;
        private LongWritable second;
        
        public LongPair() {
            this.set(new LongWritable(0), new LongWritable(0));
        }
        
        public LongPair(LongWritable first, LongWritable second) {
            this.set(first, second);
        }
        
        public LongPair(Long first, Long second) {
            this.set(new LongWritable(first), new LongWritable(second));
        }
    
        public LongPair(String first, String second) {
            this.set(new LongWritable( new Long(first)), new LongWritable( new Long(second)));
        }
    
        public LongWritable getFirst() {
            return first;
        }
    
        public LongWritable getSecond() {
            return second;
        }
    
        public void set(LongWritable first, LongWritable second) {
            this.first = first;
            this.second = second;
        }    
        
        public void setFirst(LongWritable first){
            this.first = first;
        }
        
        public void setFirst(Long first){
            this.first = new LongWritable(first);
        }
        
        public void setSecond(LongWritable second){
            this.second = second;
        }
        
        public void setSecond(Long second){
            this.second = new LongWritable(second);
        }
        
        public long getSum(){
        	return this.first.get()+this.second.get();
        }
        
        public long getDiff(){
        	return Math.abs(this.first.get()-this.second.get());
        }
        
        public LongPair inverse(){
        	return new LongPair(second, first);
        }
    
        @Override
        public boolean equals(Object o) {
            if (o instanceof LongPair) {
                LongPair p1 = (LongPair) o;
                boolean b1 = first.equals(p1.first) && second.equals(p1.second);
                LongPair p2 = p1.inverse();
                boolean b2 = first.equals(p2.first) && second.equals(p2.second);
                return b1 || b2;
            }
            return false;
        }
        
        @Override
        public int compareTo(LongPair other) {
        	long cmp = this.getSum()-other.getSum();
        	long cmp_alter = this.getDiff() - other.getDiff();
        	if(cmp<0){
        		return 1;
        	}else if(cmp>0){
        		return -1;
        	}else if(cmp_alter<0){
        		return 1;
        	}else if(cmp_alter>0){
        		return -1;
        	}
        	return 0;
        }
        
    
        @Override
        public void readFields(DataInput in) throws IOException {
            first.readFields(in);
            second.readFields(in);
        }
    
        @Override
        public void write(DataOutput out) throws IOException {
            first.write(out);
            second.write(out);
        }
    
        @Override
        public String toString() {
            return first.toString() + "," + second.toString();
        }
    Similarity

    To compute the similarity of two strings as instructed, we used a Set to store words. The advantage of set is its automatic ignorance of duplicates which enables quick calculation of union and intersection operations.

    	   public double similarity(String t1, String t2) {
    
    		   Set<String> s1 = text2Set(t1);
    		   Set<String> s2 = text2Set(t2);
    		   
    		   Set<String> union = new HashSet<String>(s1);
    		   union.addAll(s2);
    		   
    		   Set<String> intersection = new HashSet<String>(s1);
    		   intersection.retainAll(s2);
    		   
    		   if (union.size()==0){
    			   return 0;
    		   }
    		   
    		   return intersection.size()/union.size();
    	    }
    File sampling

    Our input file has more than 100000 documents. If we consider pairwise calculation in the naive approach, there will be more than 10^10 pairs emitted. This exceeds way beyond the capacity of our virtual machine. In the following section, we only tested the algorithms on sampled documents (the first 1000 documents). The files can be found here and here.

    1. Naive approach

    Instruction:

    Perform all pair wise comparisons bet ween documents, using the following technique : Each document is handled by a single mapper (remember that lines are used to represent documents in this assignment). The map method should emit, for each document, the document id along with one other document id as a key (one such pair for each other document in the corpus) and the document ’ s content as a value. In the reduce phase, perform the Jaccard computation s for all/some selected pairs . Output only similar pairs on HDFS, in TextOutputFormat.

    To avoid redundant parses of the input file, some intuition is needed. In this algorithm, the input file is only parsed once.

    • Load in advance all document id (in our case we use the total line number n that we got from the previous counter, so the id set is all the numbers in 1:n, id of empty documents will be ignored in the following of the algorithm)
    • For each id and document from the input, emit (id,i), document for all i!=id in the document id set (include also id of empty line). Due to the symmetry of the key pair, most keys will have two instances.
    • In the reduce phrase, process only keys with two instances. In this way we ignore empty documents because empty documents are not in input file, so they only appear once. Since empty documents are not often, computational time will not be too much affected.
    • The two instances are exactly the two documents that we need to compare for each key. Calculate similarity and emit key pairs that are similar.

    Following code can be found here.

    Mapper:

          @Override
          public void map(Text key, Text value, Context context)
                  throws IOException, InterruptedException {
        	  
        	 if (value.toString().isEmpty()){
        		 return;
        	 }
        	 
        	 String keyOut = key.toString();
    
        	 if (!StringUtils.isNumeric(keyOut)){
        		 System.out.println("WARN: Bas input id");
        		 System.out.println(keyOut);
        		 return;
        	 }
        		 
        	 this.words = value;
        	 this.keyPair.setFirst(Long.parseLong(keyOut));
        	 
        	 long counter = 1;
        	 
        	 while(counter<=this.fileLength){
        		 this.keyPair.setSecond(counter);
        		 
        		 if (this.keyPair.getDiff()==0){
        			 counter += 1;
        			 continue;
        		 }
        		 
        		 context.write(keyPair,words);
        		 counter += 1;
        	 }

    Reducer:

    	  @Override
          public void reduce(LongPair key, Iterable<Text> values, Context context)
                  throws IOException, InterruptedException {
        	 int counter = 0;
    		 String[] strings =  new String[2];
        	 for (Text v : values){
        		 strings[counter] = v.toString();
        		 counter += 1;
        	 }
        	     	 
        	 if (counter!=2){ // document id not in input file 
        		return;
        	 }
        	 
        	 double s = similarity(strings[0], strings[1]);
        	 context.getCounter(CompCounter.NUM).increment(1);
        	 
        	 if (s>=0.8){
                 context.write(new Text(key.toString()), new Text(String.valueOf(s))); 
        	 }
          }

    The number of mapper can be changed with this line:

          conf.set("mapreduce.input.fileinputformat.split.maxsize", String.valueOf(500));

    The hadoop job overview:

    Execution time: 7m 50s

    365085 similarity are calculated. Knowing that we have n=855 documents in our sampled file, we find 365085=n*(n-1)/2. So, the algorithm worked as expected.

    2. Pre-filtering approach

    Instruction:

    Create an inverted index, only for the first |d| -⌈t|d|⌉+ 1 words of each document d (remember that they are stored in ascending order of frequency) . In your reducer, compute the similarity of the document pairs. Output only similar pairs on HDFS, in TextOutputFormat . Report the execution time and the number of performed comparisons .

    In this part, the implementation is more trivial:

    • At the map phrase, inverse id and words, output separately all words that are in the window |d| -⌈t|d|⌉+ 1 as key, and id as value
    • Since in the map phrase we didn't output the document corpus but only ids, a hashmap for document retrieval is needed at the reduce phase. We load it in setup function.
    • At reduce phase, for each key, compute similarity if severals document id are represented. Since the words are sorted by frequency, ideally there will be much less comparaison needed

    Following code can be found here.

    Mapper:

    @Override
          public void map(Text key, Text value, Context context)
                  throws IOException, InterruptedException {
        	  
        	 if (value.toString().isEmpty()){
        		 return;
        	 }
        	 
        	 String[] document = value.toString().split(",");
        	 int window = document.length - (int)Math.floor(document.length*0.8);
        	 int counter = 0;
        	 
        	 while(counter<window){
        		 word.set(document[counter]);
        		 context.write(word,key);
        		 counter += 1;
        	 }
          }

    Reducer:

    @Override
          public void reduce(Text key, Iterable<Text> values, Context context)
                  throws IOException, InterruptedException {
    		 List<Text> val = new ArrayList<Text>();
    		 for (Text v :values){
    			 val.add(new Text(v));
    		 }
        	 for (Text v1 : val){
        		 for (Text v2: val){
        			 if (v1.equals(v2)){
        				 continue;
        			 }
        			 
        			 String s1 = this.document.get(v1).toString();
        			 String s2 = this.document.get(v2).toString();
    
        	    	 context.getCounter(CompCounter.NUM).increment(1);
        			 Double s = similarity(s1, s2);
        			 
        			 if (s>=0.8){
        				 context.write(new Text(v1.toString()+','+v2.toString()), new Text(String.valueOf(s)));
        			 }
        		 }
        	 }

    The hadoop job overview:

    Execution time: 15 s

    976 comparaisons are made in this job, much less than the naive approach.

    3 Justification of difference

    The output of similar documents can be find here. Remember that we used a sampled file, so there are way less similar docs than it supposed to be. However we can still see that similar docs are very rare, compared to the sampled input file length.

    Job # of comparaison Execution Time
    NaiveApproach 365085 7m 50s
    PrefilteringApproach 976 15s
    The naive approach takes O(n) computational time and memory, thus needs much more time, even in the shuffle and sort phase.

    The prefiltering approach is very efficient when similar documents are rare and documents are not very long, which is exactly our case. This explains the drastic performance difference.