Skip to content
Snippets Groups Projects
Select Git revision
  • 1ee47b0726cb76a92e9555de66b2c498a32aa913
  • master default
2 results

loadExpWDS.cpython-39.pyc

Blame
  • Big Data Process Assignment 2 _ Meiqi GUO

    Pre-processing the input

    For the part of pre-procesing, the input consists of:

    I do the following tasks in Preprocess.java:

    STEP 1: Remove all stopwords

    else if (stopWords.contains(word)){
            		 continue;
            		 }

    STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case

    word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())

    STEP 3:Keep each unique word only once per line

    We define a hashset where we store words

    Set<String> wordSet = new HashSet<String>();

    STEP 4: Remove empty lines

    I removed firstly all empty lines:

    if (value.toString().isEmpty()){
        		 return;
        	 }	

    After removing stopwords and special characters, I removed all new empty lines:

    if (wordSet.isEmpty()){
            	 return;

    STEP 5: Count line numbers

    I used two counters:

    • one is to note the number of lines in the initial document, named LineNumCounter;
    • the other one is to record the number of lines for the output, named FinalLineNumCounter, which means the number after removing all empty lines.

    The result is shown as below:

    NUM = 124787

    Final_NUM = 114815

    So nearly 10000 lines are empty.

    STEP 6: Order the tokens of each line in ascending order of global frequency

    I used the Words with frequency file of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of MyWordCount as input and ordered tokens by their frequency.

    Collections.sort(wordList, new Comparator<String>() {
            	 @Override
            	 public int compare(String s1, String s2)
            	 {
            		 return  wordFreq.get(s1).compareTo(wordFreq.get(s2));
            	 }
             });

    You can see the output file here.

    All the details are written in my code Preprocess.java.