Authored by: Blake Miller, Fridolin Linder, and Walter Mebane
Abstract:In the case where concepts to measure in corpora are known in advance, su- pervised methods are likely to provide better qualitative results, model selection procedures, and model performance measures. In this paper, we illustrate that much of the expense of manual corpus labeling comes from common sampling prac- tices such as random sampling that result in sparse coverage across classes, and duplicated effort of the expert who is labeling texts (it does not help your model’s performance to label a document that is very similar to a document the expert has already labeled). In this paper we outline several active learning methods for itera- tively modeling text and sampling articles based on model uncertainty with respect to unlabeled posts. We show that with particular care in sampling unlabeled data, researchers can train high performance text classification models using a fraction of the labeled documents one would need using random sampling. We illustrate this using several experiments on three corpora that vary in size and domain type (Tweets, Wikipedia talk sections, and Breitbart articles).
Read Full Paper Here