Sunday, June 26, 2011

diffbot Web Mining Hackathon !

So, I just got back at 3am this morning attending the diffbot Web Mining hackathon organized at the AOL office in Palo Alto, CA. Hackers from startups and companies like Google, AOL, Y!, Apple, LinkedIn, Students from Stanford, Designers were all there networking, sharing their ideas and booting up their notebooks getting all ready to start hacking.

The event started off with a presentation by the diffbot CEO Mike Tung welcoming all for making to the event and going through the agenda. First on the list were 2 presentations by AOL engineers talking about how they are revamping their productline and ofcourse they mentioned Techcrunch in the passing. Followed by that there were presentations by engineers from mostly early startup companies to demo some of their cool and useful APIs. To name some :

diffbot
factual
context.io
effectcheck

diffbot provides you with the article API to extract a clean article along with tags for a give URL, with a simple HTTP GET request. They also provide a frontpage API that helps you identify different sections for a given webpage. The article API to me is particularly helpful for machine learning folks who intend to develop some web mining toolkit without having to investigate native APIs to the source website.

factual provided with some datasets for people to try out with. I don't think anyone used their datasets :) The Factual server API enables direct access to Factual tables from your application, for the purpose of discovering datasets, reading from datasets, and correcting/adding rows to datasets.  It uses a RESTful interface and returns data in JSON.

context.io provides an API to leverage the data contained in users mailboxes
It's however, not meant to send or receive emails, it's a tool to extract rich information from the mailbox your users use as their own email address. The cool feature I thought in their demo was how it serves as a dropbox for the attachments in your email thread.

effectcheck provides you with a sentiment score [0-4] for anxiety, hostility, depression, confidence, compassion and happiness for a given article text. The effectcheck API internally uses diffbot API for identifying article texts.

Once the engineers were done with their presentations, everyone was asked to introduce themself with atleast 3 other people. It was then that the crowd started discussing their ideas for the project and forming teams.

My idea was to use the diffbot API and build a recommendation engine for blogs based on tags and sentiments derived from diffbot and effectcheck APIs respectively. I teamed up with 2 Software Engineers from AOL and an intern from VMware. I built 3 flavors of clustering implementations based on using tags and/or sentiments with tuning distance metrics. I chose JaccardSimilarity and Euclidean distance to measure distance between tags and sentiments features between two articles. I also built an inverted index on tags to aggregate the sentiment vectors from different web articles to analyze what tags/keywords pertained to the strongest sentiments. Did I tell you that the competition was scheduled to end at 3am in the morning and I received the real data(that one of the team member was working on) at 2am. One of the team member left unannounced after about an hour after the hackathon started. Anyways, I guess thats a experience as well :) I used R to create a tagCloud and a heatmap to visualize the correlation between tags and the sentiments sentiments.

There were some interesting demos from teams making good use of the APIs. However, I think there were a lot of them that missed the whole point of the theme of hackathon, which was "web mining" i.e. to build a project based on Statistical Inference/ML/DM techniques on top of data retrieved from the web using the APIs and not just how to use APIs and display stuff on a iPhone !

Overall, it was a splendid experience interacting with hackers and working and completing a project from scratch in 12 straight hours. I definitely see myself exploring some of these APIs and putting them to use.

Keep checking the github space(one in the title), I will be committing the code soon...

No comments:

Post a Comment