Association.js
Big Data Process Assignment 2 _ Meiqi GUO
Pre-processing the input
For the part of pre-procesing, the input consists of:
- the document corpus of pg100.txt
- the Stopword file which I made in the assignment 1
- the Words with frequency file of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of MyWordCount.java.
I do the following tasks in Preprocess.java:
STEP 1: Remove all stopwords
else if (stopWords.contains(word)){
continue;
}
STEP 2: Remove special characters (keep only [a-z],[A-Z] and [0-9]) and convert to lower case
word.set(token.replaceAll("[^A-Za-z0-9]+", "").toLowerCase())
STEP 3:Keep each unique word only once per line
We define a hashset where we store words
Set<String> wordSet = new HashSet<String>();
STEP 4: Remove empty lines
I removed firstly all empty lines:
if (value.toString().isEmpty()){
return;
}
After removing stopwords and special characters, I removed all new empty lines:
if (wordSet.isEmpty()){
return;
STEP 5: Count line numbers
I used two counters:
- one is to note the number of lines in the initial document, named LineNumCounter;
- the other one is to record the number of lines for the output, named FinalLineNumCounter, which means the number after removing all empty lines.
The result is shown as below:
NUM = 124787
Final_NUM = 114815
So nearly 10000 lines are empty.
STEP 6: Order the tokens of each line in ascending order of global frequency
I used the Words with frequency file of pg100.txt that I obtained by runnning the assignment 1 with a slight changement of MyWordCount as input and ordered tokens by their frequency.
Collections.sort(wordList, new Comparator<String>() {
@Override
public int compare(String s1, String s2)
{
return wordFreq.get(s1).compareTo(wordFreq.get(s2));
}
});
You can see the output file here.
All the details are written in my code Preprocess.java.