Reducing Manual Labeling Effort in Imbalanced Data Sets: Active Learning for Detecting Illicit Massage Business Reviews
Abstract
Human trafficking investigators face challenges when processing the sheer volume of publicly available online data. Natural language processing (NLP) models can assist in identifying evidence of exploitation in text data, such as business reviews. However, the scarcity of large and accurately labeled training data sets hinders the potential for NLP-based detection algorithms. Labeling data sets related to human trafficking is challenging because identifying indicators of trafficking requires domain expertise, trafficking cases make up a small portion of the data, and reviewing disturbing content is emotionally demanding for individuals. Active learning optimizes model training by strategically querying the most informative data points’ labels, achieving high accuracy with minimal annotations. We formulate active learning as a decision model and learn a policy through deep reinforcement learning. We evaluate this approach for the imbalanced classification task of detecting Yelp reviews of massage businesses that contain human trafficking risk factors. The active learning policy surpasses benchmark methods in the scoring metric used for classifier training. Moreover, its strong performance remains consistent even in large batch query settings. The proposed approach is compatible with any scoring metric and is particularly well-suited for imbalanced NLP tasks in which labeling demands substantial time, domain expertise, and emotional effort.
Funding: This research was partly supported by the Criminal Investigations and Network Analysis (CINA) [Grant Award 17STCIN00001].
Supplemental Material: All supplemental materials, including the code, data, and files required to reproduce the results, are available at https://doi.org/10.1287/opre.2023.0625.

