RapLeaf hosted a Data Mining competition on www.kaggle.com, which ran for 3 weeks and I am happy that the 100 - 120 hours that I spent on it with a total of 62 submissions were well worth having won the competition. This was the second data mining competition that I participated on kaggle.com and won. The first one was a Wine Price prediction competition(http://blog.kaggle.com/2010/12/13/how-we-did-it-jie-and-neeral-on-winning-the-first-kaggle-in-class-competition-at-stanford/) hosted by Stanford University that I participated and won along with Jie Yang, a Research Engineer from Yahoo Labs.
The competition data consisted of user demographics information along with the URLs and headlines that each user visited. A training set was provided to help train the algorithm to correctly predict user behavior. The goal of the competition was to predict a consumer behavior(i.e. sign up for newsletter, read another article, etc.) on a Personal Finance Site(http://www.fool.com) for the users that are not in the training set. The competition submission file needed to a csv file with 100,000 rows and 2 columns(uid, behavior). The competition was constrained to maximum 5 submissions per day.
The data files were as follows:
The data files were as follows:
- demographics: One record per user along with Rapleaf data about each one
- headlines: Each row contains one URL accessed by one user, along with the title of that page, and the number of times that user accessed that URL
- training: For those users in the training set, a binary value indicating whether or not they subscribed
- example_entry: A sample entry showing the users in the test set, and a constant value. For your entries, you should replace the constant value with a probability of subscription, based on your model
Training Data size : 201398 users
Test Data size : 100000 users
Headlines data size : 6M
I would say about 90% of the effort that I spent was on feature engineering.
Here is an example URL from the headlines data.
I created new features with the first word in the URL string delimited by ‘\’, here, investing and also features concatenating first and second word, here, investing.beginning. The value that I assigned to these features were the repetitions (i.e. the number of times the user viewed URLs with the word. The features were then row-normalized to take out the influence of hyper-active users.
Also, based on the training data, I figured out the distinct headline words(after removing stop-words) for behavior = 0(words.0) and behavior = 1(words.1). For each user, I then created 2 features that captured the Jaccard Similarity based on the all headline words for the user with words.0 and words.1 respectively.
Based on my experience from a Web Mining Hackathon that I participated a few months, I used diffbot API to crawl the date a particular blog/article at the URL was written. I also retrieved the Author name and the article text for the blog.
I created one feature for one author having the value as the total number of repetitions per user, per author. The author features were row-normalized. Two other author related Jaccard Similarity features were added based on auth.0 and auth.1 (i.e. auth.0 = authors who wrote articles read by users with behavior = 0, auth.1 similarly for behavior =1).
From the Wikipedia page for Motley Fool, I discovered that the paid subscription was started in April 2002. Based on this knowledge, I engineered the following date related features(date was the when the article/blog was posted by the author)
- Average date difference for each user w.r.to April 01, 2002
- 1 feature per year. i.e. year.1999, year.2000, etc. having value as the total repetitions per user, per year row-normalized
- 2 features relating to proportions of dates that were on or before April 01, 2002 and after.
Finally, I performed Latent Dirichlet Allocation Topic Modeling on both the article text(10 topics) and headline words(10 topics) to add topic proportion features for both articles and headlines. Stopwords were removed before performing LDA.
The rest were all demographics features. The was a lot of data cleaning and processing involved since there were a lot of blank lines in the headlines data and also the crawled data had noisy dates that needed to be fixed. Also, the author names were not consistent, in the sense for some blogs an author say "Jerome Seinfeld" found it cool to put his name as "Jerome 'the comedian' Seinfeld". There were a lot of such cases which need to be mapped back to the original name using regular expressions in R.
With all the above features, I had about 500 features. I performed a dry run of Gradient Boosted Machines to identify the important features and threw away the rest of them. Relative Influence returned by the summary function from gbm R package helped me filter the important features. After the filtering, I was left with 120 features.
I used RandomForest, Gradient Boosted Machines(GBM), Linear Model and Robust Linear Model to form an ensemble for prediction. For the RandomForest, I had to settle for 200 trees due to time constraints, if time permitted, I would have settled for atleast the default which is 500 trees. For GBM, I chose to go for shrinkage = 0.001, 5000 trees and 5-fold cross-validation. The number of trees for GBM was decided based on best performance on CV error. For each of the models, I trained separately on pool of users who had just demographics information and users who had both demographics and headlines data available. Through experimentation, I found that this method proved to return a better AUC. Also, I would have liked to experiment with Mahout implementation of randomForest to see if I could have got a faster turn-around.
I ended up with an AUC of 0.80457 on the public leaderboard and an AUC of 0.80224 on the final test set.
Team : Seeker
I primarily used R for modeling purpose and Java for crawling the URLs with the help of diffbot API.
R packages used :