Consider a simple 2-class SVM, where the goal is to identify a hyperplane that linearly separates the instances and maximizes the boundary. It is quite intuitive to believe that the uncertainty of classifying instances is the highest around the margin or the support vectors, especially in a relaxed setting where you allow for a few training misclassification with introduction of slack variables.

**Query by uncertainty**

In uncertainty sampling, instances are preferred that receives the lowest class probability estimate by the ensemble of classifiers. The class probability is the highest probability with which an instance is assigned a class label. The degree of uncertainty could be measured as simply 1 - P(Y_hat|x), where Y_hat is the class label for with the conditional probability P(Y|x) is the highest. There are other ways to measure as well, like:

- Entropy
- Difference in conditional probability : P(Y1|x) - P(Y2|x), where Y1 is the class for which x has the highest probability and Y2 is the next highest.

**Query by committee**

The idea here is to form a committee of weak learners that are either totally different classification methods like decision tree, SVM, k-NN, etc or a bunch of same type of method, say like 100 decision trees that you generate via bagging. Each of these classifiers would work on the initial small set of labeled instances. The goal is here to identify those unlabeled instances for which base learners differ in their votes the most. Engelson and Dangon(1996) formulated a voted entropy to identify candidates for labeling.

One of the extension to the uncertainty method that I am currently working on is how to avoid sampling an unlabeled instance as a candidate for labeling, the contribution to the model of which was potentially already explained by an earlier sampled unlabeled point within epsilon distance. One of the approach here could be to apply Canopy Clustering on identified candidates for labeling, i.e. add another point only if the distance to all the points so far in the canopy list is at-least T1.

ReplyDeleteThank you for sharing such great information !

big data classroom training

hadoop ónlinÉ training

free big data bootcamp

hadoop big data videos

Big data QA Tester training

Big data Analyst training