Machine Learning on Big Data

Wednesday, October 5, 2011

RapLeaf Data Mining Competition

RapLeaf hosted a Data Mining competition on www.kaggle.com, which ran for 3 weeks and I am happy that the 100 - 120 hours that I spent on it with a total of 62 submissions were well worth having won the competition. This was the second data mining competition that I participated on kaggle.com and won. The first one was a Wine Price prediction competition(http://blog.kaggle.com/2010/12/13/how-we-did-it-jie-and-neeral-on-winning-the-first-kaggle-in-class-competition-at-stanford/) hosted by Stanford University that I participated and won along with Jie Yang, a Research Engineer from Yahoo Labs.

The competition data consisted of user demographics information along with the URLs and headlines that each user visited. A training set was provided to help train the algorithm to correctly predict user behavior. The goal of the competition was to predict a consumer behavior(i.e. sign up for newsletter, read another article, etc.) on a Personal Finance Site(http://www.fool.com) for the users that are not in the training set. The competition submission file needed to a csv file with 100,000 rows and 2 columns(uid, behavior). The competition was constrained to maximum 5 submissions per day.

Data Provided
The data files were as follows:

demographics: One record per user along with Rapleaf data about each one
headlines: Each row contains one URL accessed by one user, along with the title of that page, and the number of times that user accessed that URL
training: For those users in the training set, a binary value indicating whether or not they subscribed
example_entry: A sample entry showing the users in the test set, and a constant value. For your entries, you should replace the constant value with a probability of subscription, based on your model

Training Data size : 201398 users

Test Data size : 100000 users

Headlines data size : 6M

I would say about 90% of the effort that I spent was on feature engineering.

Feature Engineering

Here is an example URL from the headlines data.

/investing/beginning/2006/12/19/credit-check-countdown.aspx

I created new features with the first word in the URL string delimited by ‘\’, here, investing and also features concatenating first and second word, here, investing.beginning. The value that I assigned to these features were the repetitions (i.e. the number of times the user viewed URLs with the word. The features were then row-normalized to take out the influence of hyper-active users.

Also, based on the training data, I figured out the distinct headline words(after removing stop-words) for behavior = 0(words.0) and behavior = 1(words.1). For each user, I then created 2 features that captured the Jaccard Similarity based on the all headline words for the user with words.0 and words.1 respectively.

Based on my experience from a Web Mining Hackathon that I participated a few months, I used diffbot API to crawl the date a particular blog/article at the URL was written. I also retrieved the Author name and the article text for the blog.

I created one feature for one author having the value as the total number of repetitions per user, per author. The author features were row-normalized. Two other author related Jaccard Similarity features were added based on auth.0 and auth.1 (i.e. auth.0 = authors who wrote articles read by users with behavior = 0, auth.1 similarly for behavior =1).

From the Wikipedia page for Motley Fool, I discovered that the paid subscription was started in April 2002. Based on this knowledge, I engineered the following date related features(date was the when the article/blog was posted by the author)

Average date difference for each user w.r.to April 01, 2002
1 feature per year. i.e. year.1999, year.2000, etc. having value as the total repetitions per user, per year row-normalized
2 features relating to proportions of dates that were on or before April 01, 2002 and after.

Finally, I performed Latent Dirichlet Allocation Topic Modeling on both the article text(10 topics) and headline words(10 topics) to add topic proportion features for both articles and headlines. Stopwords were removed before performing LDA.

The rest were all demographics features. The was a lot of data cleaning and processing involved since there were a lot of blank lines in the headlines data and also the crawled data had noisy dates that needed to be fixed. Also, the author names were not consistent, in the sense for some blogs an author say "Jerome Seinfeld" found it cool to put his name as "Jerome 'the comedian' Seinfeld". There were a lot of such cases which need to be mapped back to the original name using regular expressions in R.

With all the above features, I had about 500 features. I performed a dry run of Gradient Boosted Machines to identify the important features and threw away the rest of them. Relative Influence returned by the summary function from gbm R package helped me filter the important features. After the filtering, I was left with 120 features.

Modeling

I used RandomForest, Gradient Boosted Machines(GBM), Linear Model and Robust Linear Model to form an ensemble for prediction. For the RandomForest, I had to settle for 200 trees due to time constraints, if time permitted, I would have settled for atleast the default which is 500 trees. For GBM, I chose to go for shrinkage = 0.001, 5000 trees and 5-fold cross-validation. The number of trees for GBM was decided based on best performance on CV error. For each of the models, I trained separately on pool of users who had just demographics information and users who had both demographics and headlines data available. Through experimentation, I found that this method proved to return a better AUC. Also, I would have liked to experiment with Mahout implementation of randomForest to see if I could have got a faster turn-around.

I ended up with an AUC of 0.80457 on the public leaderboard and an AUC of 0.80224 on the final test set.

http://www.kaggle.com/c/RapLeafHackAThon/Leaderboard

Team : Seeker

Software

I primarily used R for modeling purpose and Java for crawling the URLs with the help of diffbot API.

R packages used :

lda
lsa
gbm
randomForest
lm
MASS
kernlab
e1071
som

Wednesday, September 7, 2011

Active Learning

Active Learning is a method to identify data points to label in a supervised setting in a domain wherein the cost of labeling is very high. The basic idea is that you start of with a small set of labeled instances and build a model or a committee of models and use the model(s) to identify unlabeled instances for which the degree of confusion/uncertainty is highest.

Consider a simple 2-class SVM, where the goal is to identify a hyperplane that linearly separates the instances and maximizes the boundary. It is quite intuitive to believe that the uncertainty of classifying instances is the highest around the margin or the support vectors, especially in a relaxed setting where you allow for a few training misclassification with introduction of slack variables.

Query by uncertainty
In uncertainty sampling, instances are preferred that receives the lowest class probability estimate by the ensemble of classifiers. The class probability is the highest probability with which an instance is assigned a class label. The degree of uncertainty could be measured as simply 1 - P(Y_hat|x), where Y_hat is the class label for with the conditional probability P(Y|x) is the highest. There are other ways to measure as well, like:

Entropy
Difference in conditional probability : P(Y1|x) - P(Y2|x), where Y1 is the class for which x has the highest probability and Y2 is the next highest.

Query by committee
The idea here is to form a committee of weak learners that are either totally different classification methods like decision tree, SVM, k-NN, etc or a bunch of same type of method, say like 100 decision trees that you generate via bagging. Each of these classifiers would work on the initial small set of labeled instances. The goal is here to identify those unlabeled instances for which base learners differ in their votes the most. Engelson and Dangon(1996) formulated a voted entropy to identify candidates for labeling.

One of the extension to the uncertainty method that I am currently working on is how to avoid sampling an unlabeled instance as a candidate for labeling, the contribution to the model of which was potentially already explained by an earlier sampled unlabeled point within epsilon distance. One of the approach here could be to apply Canopy Clustering on identified candidates for labeling, i.e. add another point only if the distance to all the points so far in the canopy list is at-least T1.

Sunday, June 26, 2011

diffbot Web Mining Hackathon !

So, I just got back at 3am this morning attending the diffbot Web Mining hackathon organized at the AOL office in Palo Alto, CA. Hackers from startups and companies like Google, AOL, Y!, Apple, LinkedIn, Students from Stanford, Designers were all there networking, sharing their ideas and booting up their notebooks getting all ready to start hacking.

The event started off with a presentation by the diffbot CEO Mike Tung welcoming all for making to the event and going through the agenda. First on the list were 2 presentations by AOL engineers talking about how they are revamping their productline and ofcourse they mentioned Techcrunch in the passing. Followed by that there were presentations by engineers from mostly early startup companies to demo some of their cool and useful APIs. To name some :

diffbot
factual
context.io
effectcheck

diffbot provides you with the article API to extract a clean article along with tags for a give URL, with a simple HTTP GET request. They also provide a frontpage API that helps you identify different sections for a given webpage. The article API to me is particularly helpful for machine learning folks who intend to develop some web mining toolkit without having to investigate native APIs to the source website.

factual provided with some datasets for people to try out with. I don't think anyone used their datasets :) The Factual server API enables direct access to Factual tables from your application, for the purpose of discovering datasets, reading from datasets, and correcting/adding rows to datasets. It uses a RESTful interface and returns data in JSON.

context.io provides an API to leverage the data contained in users mailboxes
It's however, not meant to send or receive emails, it's a tool to extract rich information from the mailbox your users use as their own email address. The cool feature I thought in their demo was how it serves as a dropbox for the attachments in your email thread.

effectcheck provides you with a sentiment score [0-4] for anxiety, hostility, depression, confidence, compassion and happiness for a given article text. The effectcheck API internally uses diffbot API for identifying article texts.

Once the engineers were done with their presentations, everyone was asked to introduce themself with atleast 3 other people. It was then that the crowd started discussing their ideas for the project and forming teams.

My idea was to use the diffbot API and build a recommendation engine for blogs based on tags and sentiments derived from diffbot and effectcheck APIs respectively. I teamed up with 2 Software Engineers from AOL and an intern from VMware. I built 3 flavors of clustering implementations based on using tags and/or sentiments with tuning distance metrics. I chose JaccardSimilarity and Euclidean distance to measure distance between tags and sentiments features between two articles. I also built an inverted index on tags to aggregate the sentiment vectors from different web articles to analyze what tags/keywords pertained to the strongest sentiments. Did I tell you that the competition was scheduled to end at 3am in the morning and I received the real data(that one of the team member was working on) at 2am. One of the team member left unannounced after about an hour after the hackathon started. Anyways, I guess thats a experience as well :) I used R to create a tagCloud and a heatmap to visualize the correlation between tags and the sentiments sentiments.

There were some interesting demos from teams making good use of the APIs. However, I think there were a lot of them that missed the whole point of the theme of hackathon, which was "web mining" i.e. to build a project based on Statistical Inference/ML/DM techniques on top of data retrieved from the web using the APIs and not just how to use APIs and display stuff on a iPhone !

Overall, it was a splendid experience interacting with hackers and working and completing a project from scratch in 12 straight hours. I definitely see myself exploring some of these APIs and putting them to use.

Keep checking the github space(one in the title), I will be committing the code soon...

Friday, June 17, 2011

k-medoids Clustering

I am very excited to share that I have completed implementing the k-medoids Clustering algorithm that share a lot in common with k-means, however, unlike k-means where the centroid is the mean of the data-points within the same cluster, k-medoids on the other hand calculates the medoid(centroid) as an actual representative data-point amongst the data-points belonging to the same cluster. So, the idea is that the representative data-point or the medoid minimizes some optimization function, which in most cases, is an error function.

The data that I chose to use was movielens to cluster movies based on what users watched a movie. The data is available on : http://www.grouplens.org/node/73. I have chosen the one with 1 million ratings. But the structure of the data is the same across all different sizes of data.
There are 3 data files once you untar.

MOVIE data
movie id :: movie title :: movie genre

USERS data
user id :: gender :: age :: occupation :: zip code

RATINGS data
user id :: movie id :: rating :: timestamp

Since, I needed data-points to be movies, i.e. for each movie would have the list of users that watched this movie. This requires massaging the RATINGS data a bit.

For e.g.

if the RATINGS data has :

1::2000::5::978300760
2::2000::3::978302109
3::2000::3::978301968
4::2000::4::978300275
5::2000::5::978824291

you would want to have the data that you want k-medoids to fed something like this:

2000::1^2^3^4^5

Please note that I am ignoring the following fields :
USERS$gender
USERS$age
USERS$occupation
USERS$zip code

RATINGS$rating
RATINGS$timestamp

All I care is what users watched a particular movie. After performing the data transpose, I had in my input data 3,706 movies.

To accomplish this, I resorted to a one-time execution of a map-reduce job that aggregates the user list for each movie and outputs to a file that would serve as a input to the k-medoids algorithm.

So, as I mentioned earlier the only thing that separates k-medoids from k-means is the way the Centroids(medoids here) are re-calculated. I applied the PAM algorithm(Partitioning around medoids) to find k medoids, one for each of the k clusters.

The algorithm proceeds in two steps:
BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids

SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased.

This entire process can be implemented in the reducer as follows :

MAPPER : (exactly as in k-means) Finds out the closest medoid for the data-point and emits as key the medoid id and the value as the data-point.

REDUCER : (implements the PAM)
for each movie in values
    if making movie as new centroid, improves the best sse
        swap movie with the current centroid
        best sse = sse(movie, values)

I chose JaccardSimilarity as a measure of how similar 2 movies are. If A and B are 2 movies and the set of users that have watched movies A and B are a and b respectively, the JaccardSimilarity is given by

JaccardSimilarity(a,b) = |intersection(a, b)| / |union(a, b)|

Dissimilarity(a, b) or Distance(a, b) thats required for calculating SSE is simply 1 - JaccardSimilarity(a, b).

I chose k = 37, no apriori tuning, just since I wanted to form clusters of size 3706/20 = approx. 100.

Silhouette is an index [-1, 1] indicating how good the resulting clusters are. A value close to 1 for Silhouette index indicates the data-points are perfectly clustered. A Silhouette of -1 indicates all the data-points have been wrongly assigned to the clusters. A Silhouette close to 0 indicates that data-points are equally close to atleast one other cluster than the one they are assigned to. Silhouette for a data-point i is defined as :

Silhouette(i) = b(i) - a(i) / max(a(i), b(i))
b(i) : average distance from data-point i to all the data-points in the the closest cluster(other than the one i belongs to)
a(i) : average distance from data-point i to all the data-points in the cluster i belongs to

The initial execution with the setting of k=37 gave an overall Silhouette of 0.06555513; which indicated that most of the data-points were almost exactly within the same distance of atleast one other cluster than the cluster it belongs to. This situation can be overcome by two ways :

- Post-process the resultant clusters to merge clusters that improve the overall SSE
- Choose different values of k

I preferred for the moment going for tuning with different values of k.

I will be committing the code soon on github. Check back later and I will post a link to both the k-means and k-medoids implementations in Java.

Thursday, June 16, 2011

k-means Clustering

I am starting off my blog with one of my favorite unsupervised class of algorithms - kmeans.

I just completed successfully implementing k-means on a simple data, data generated from a mixture of 3 Gaussian distributions with means spread out and standard deviation of 1. I have tested it both in a pseudo-distributed environment locally as well as on a small EC2 cluster on Amazon AWS.

Let me jot down the plain old k-means algorithm, even though I am sure most you reading this post might be aware of it.

Iterative k-means
Step 1 : Choose k random initial cluster centroids
Step 2 : Assign each data point to the closest cluster centroid
Step 3 : Re-calculate the cluster centroids based on some aggregate function applied on the data points that belong to the same cluster
Step 4 : Repeat Step 2, 3 until data points do not change clusters

Map-Reduce k-means
The above iterative version of k-means can be converted to a map-reduce framework as follows:

A JVM is spawned for every block of data for the slave node. The map method gets called for every record with the block. In the initialization step, the mapper randomly picks a number between 0..k-1 and assigns it as the centroid id to the data-point. The map method hence emits the random number indicating the centroid id as key and the data-point as value. Please note there there may be other implementations where the mapper emits the centroid id as key and a two element tuple <sum, n> as value, where sum is the sum of all data-points belonging to the centroid and n is the number of data-points assigned to the cluster.
M1 :: map(k1, v1)
    choose a random number c between 0..k-1
    emit <c, v1>

In every iteration following the initialization step, the mapper calculates the nearest centroid for each data-point and emits the centroid id as key and the data point as a value.
M2 :: map(k1, v1)
    find the nearest centroid c to data-point v1
    emit <c, v1>

The actual work of reducer starts once all the mappers have completed executing. Reducer aggregates the list of data-points in list v and emits the centroid id as key and the aggregate as value. The aggregate(new centroid) in kmeans is the mean of all the data-points in the cluster. The same reducer can be used to aggregate the values at the initialization as well at each of the iteration step. In fact, in this scenario, the reducer can also be used to utilize the combiner or mapper-final functionality of map-reduce framework.
R1 :: reduce(k2, list v)
    emit <k2, aggregate(v)>

Just on a side note, I am using the Amazon Elastic MapReduce Ruby Client to submit my job on AWS. For that, I need to create bucket on S3 and save my input data and jar file. s3cmd utility is a great command tool to performing some useful filesystem command. It shares a lot in common with the 'hadoop fs' command.
Amazon Elastic MR Ruby Client : http://aws.amazon.com/developertools/2264
s3 tools : http://s3tools.org/s3tools

One other thing to note here is that when you create a jobflow on Amazon AWS, it takes a while to go from 'starting' to 'running' mode and hence it sometime is irritating if you are developing a prototype or just fixing a minor bug in your program and want to test it out. It is for this reason why I won't really recommend the Amazon AWS API since your code would then be totally dependent that your inputs and outputs are all supposed to be on S3. If you pass your paths in some properties file or just pass them as command line arguments, you can then run the same code locally on a single node first using the hadoop command and then Amazon AWS using the Elastic Mapreduce ruby client. The only difference that I have really seen is that if you have any 'hadoop fs' commands in your code, for the single node execution using the hadoop command on command line, you need to specify the complete path i.e. '/usr/local/hadoop/bin/hadoop fs -copyToLocal ...' vs. when you execute on Amazon AWS you can just say 'hadoop fs -copyToLocal ...'.

I am currently in the process of trying out k-medoids, again a prototype based clustering method, however, unlike k-means, where the centroids are an approximation of the cluster data-points and not an actual data-point, k-mediod chooses an actual data-points as the centroids as representatives of the clusters. I am testing k-medoids on the movie-lens data with 1 million movie ratings. I will share more insight in the next blog about this.