RapLeaf hosted a Data Mining competition on www.kaggle.com,
which ran for 3 weeks and I am happy that the 100 - 120 hours that I spent on
it with a total of 62 submissions were well worth having won the competition. This was the second data mining competition that I participated on kaggle.com and won. The first one was a Wine Price prediction competition(http://blog.kaggle.com/2010/12/13/how-we-did-it-jie-and-neeral-on-winning-the-first-kaggle-in-class-competition-at-stanford/) hosted by Stanford University that I participated and won along with Jie Yang, a Research Engineer from Yahoo Labs.
The competition data consisted of user demographics information along with the URLs
and headlines that each user visited. A training set was provided to help
train the algorithm to correctly predict user behavior. The goal of the competition was to predict a consumer behavior(i.e. sign up for newsletter, read another
article, etc.) on a
Personal Finance Site(http://www.fool.com) for the users that are not in the
training set. The competition submission file needed to a csv file with 100,000 rows and 2 columns(uid, behavior). The competition was constrained to maximum 5 submissions per day.
Data Provided
The data files were as follows:
The data files were as follows:
- demographics: One record per user along with Rapleaf data about each one
- headlines: Each row contains one URL accessed by one user, along with the title of that page, and the number of times that user accessed that URL
- training: For those users in the training set, a binary value indicating whether or not they subscribed
- example_entry: A sample entry showing the users in the test set, and a constant value. For your entries, you should replace the constant value with a probability of subscription, based on your model
Training Data size : 201398 users
Test Data size : 100000 users
Headlines data size : 6M
I would say about 90% of the effort that I spent was on
feature engineering.
Feature Engineering
Here is an example URL from the headlines data.
/investing/beginning/2006/12/19/credit-check-countdown.aspx
I created new features with the first word in the URL string
delimited by ‘\’, here, investing and also features concatenating first and
second word, here, investing.beginning. The value that I assigned to these
features were the repetitions (i.e. the number of times the user viewed URLs
with the word. The features were then row-normalized to take out the influence
of hyper-active users.
Also, based on the training data, I figured out the distinct
headline words(after removing stop-words) for behavior = 0(words.0) and
behavior = 1(words.1). For each user, I then created 2 features that captured
the Jaccard Similarity based on the all headline words for the user with
words.0 and words.1 respectively.
Based on my experience from a Web Mining Hackathon that I
participated a few months, I used diffbot API to crawl the date a particular
blog/article at the URL was written. I also retrieved the Author name and the
article text for the blog.
I created one feature for one author having the value as the
total number of repetitions per user, per author. The author features were
row-normalized. Two other author related Jaccard Similarity features were added
based on auth.0 and auth.1 (i.e. auth.0 = authors who wrote articles read by
users with behavior = 0, auth.1 similarly for behavior =1).
From the Wikipedia page for Motley Fool, I discovered that
the paid subscription was started in April 2002. Based on this knowledge, I
engineered the following date related features(date was the when the
article/blog was posted by the author)
- Average date difference for each user w.r.to April 01, 2002
- 1 feature per year. i.e. year.1999, year.2000, etc. having value as the total repetitions per user, per year row-normalized
- 2 features relating to proportions of dates that were on or before April 01, 2002 and after.
Finally, I performed Latent Dirichlet Allocation Topic
Modeling on both the article text(10 topics) and headline words(10 topics) to
add topic proportion features for both articles and headlines. Stopwords were
removed before performing LDA.
The rest were all demographics features. The was a lot of data cleaning and processing involved since there were a lot of blank lines in the headlines data and also the crawled data had noisy dates that needed to be fixed. Also, the author names were not consistent, in the sense for some blogs an author say "Jerome Seinfeld" found it cool to put his name as "Jerome 'the comedian' Seinfeld". There were a lot of such cases which need to be mapped back to the original name using regular expressions in R.
With all the above features, I had about 500 features. I
performed a dry run of Gradient Boosted Machines to identify the important
features and threw away the rest of them. Relative Influence returned by the
summary function from gbm R package helped me filter the important features.
After the filtering, I was left with 120 features.
Modeling
I used RandomForest, Gradient Boosted Machines(GBM), Linear Model and Robust Linear
Model to form an ensemble for prediction. For the RandomForest, I had to settle
for 200 trees due to time constraints, if time permitted, I would have settled
for atleast the default which is 500 trees. For GBM, I chose to go for
shrinkage = 0.001, 5000 trees and 5-fold cross-validation. The number of trees for GBM was decided based on best performance on CV error. For each of the
models, I trained separately on pool of users who had just demographics
information and users who had both demographics and headlines data available.
Through experimentation, I found that this method proved to return a better
AUC. Also, I would have liked to experiment with Mahout implementation of
randomForest to see if I could have got a faster turn-around.
I ended up with an AUC of 0.80457 on the public leaderboard
and an AUC of 0.80224 on the final test
set.
Team : Seeker
Software
I primarily used R for modeling purpose and Java for
crawling the URLs with the help of diffbot API.
R packages used :
- lda
- lsa
- gbm
- randomForest
- lm
- MASS
- kernlab
- e1071
- som