Wednesday, September 7, 2011

Active Learning

Active Learning is a method to identify data points to label in a supervised setting in a domain wherein the cost of labeling is very high. The basic idea is that you start of with a small set of labeled instances and build a model or a committee of models and use the model(s) to identify unlabeled instances for which the degree of confusion/uncertainty is highest.

Consider  a simple 2-class SVM, where the goal is to identify a hyperplane that linearly separates the instances and maximizes the boundary. It is quite intuitive to believe that the uncertainty of classifying instances is the highest around the margin or the support vectors, especially in a relaxed setting where you allow for a few training misclassification with introduction of slack variables.

Query by uncertainty
In uncertainty sampling, instances are preferred that receives the lowest class probability estimate by the ensemble of classifiers. The class probability is the highest probability with which an instance is assigned a class label. The degree of uncertainty could be measured as simply 1 - P(Y_hat|x), where Y_hat is the class label for with the conditional probability P(Y|x) is the highest. There are other ways to measure as well, like:

  • Entropy
  • Difference in conditional probability : P(Y1|x) - P(Y2|x), where Y1 is the class for which x has the highest probability and Y2 is the next highest.

Query by committee
The idea here is to form a committee of weak learners that are either totally different classification methods like decision tree, SVM, k-NN, etc or a bunch of same type of method, say like 100 decision trees that you generate via bagging. Each of these classifiers would work on the initial small set of labeled instances. The goal is here to identify those unlabeled instances for which base learners differ in their votes the most. Engelson and Dangon(1996) formulated a voted entropy to identify candidates for labeling.

One of the extension to the uncertainty method that I am currently working on is how to avoid sampling an unlabeled instance as a candidate for labeling, the contribution to the model of which was potentially already explained by an earlier sampled unlabeled point within epsilon distance. One of the approach here could be to apply Canopy Clustering on identified candidates for labeling, i.e. add another point only if the distance to all the points so far in the canopy list is at-least T1.