Authored by: Blake Miller, Walter Mebane, and Joseph Klaver
Abstract:To achieve requisite performance, social scientists must invest a lot of time and money manually labeling documents for supervised text classification. For most text classification problems, a subset of documents is manually labeled, and these documents are used to t a classifier that predicts the class of out-of-sample documents. The choice of sampling method, however, is often overlooked, and researchers often default to a random sample. Unfortunately this sometimes leads to sparse coverage of rare classes, takes a great amount of time and effort to reach acceptable performance, and is an inefficient way of providing adequate training sample variance. In this paper we outline a method for iteratively sampling articles using cluster sampling of a version space (a space of uncertainty) as defined by the distance to the separating hyperplanes of a set of candidate support vector classifiers. Using this framework, an oracle labels a set of documents and at regular intervals, support vector classifiers are refit, and unlabeled documents within this version space are sampled by cluster. As Tong and Keller (2008) have shown, this iterative process allows researchers can achieve high levels of classification accuracy with fewer effort labeling documents. Using an experiment we compare performance to existing algorithms.