Semantic Aggregated Adversarial Training Framework for Hate Speech Detection

Published Online:https://doi.org/10.1287/ijoc.2023.0508

Hate speech poses a growing challenge to digital platforms with large and diverse user bases, prompting widespread adoption of deep learning (DL) models for automated detection at scale. Existing research, however, predominantly focuses on improving detection accuracy while paying limited attention to the vulnerability of DL-based detection models to adversarial attacks from malicious spreaders. To bridge this gap, we propose an adversarial training framework to improve the adversarial robustness of hate speech detection. This framework integrates imbalanced adversarial training with a novel semantic aggregation technology to learn robust yet discriminative features from hate speech corpora. We further introduce an adversarial attack generation framework to assess the performance of existing DL-based hate speech detection models under such attacks. Extensive computational experiments conducted on eight publicly available hate speech corpora demonstrate the robustness of the proposed method against attacks. In contrast, we show that existing DL-based detection models can be easily circumvented by adversarial attacks, allowing the dissemination of hateful sentiments through subtle modifications to the content. Additionally, we conduct comparative analyses of the proposed method with various adversarial training and imbalance training methods to illustrate its effectiveness in simultaneously addressing the data imbalance and feature inseparability issues inherent in hate speech detection. This study presents significant managerial implications, aiding online platforms in implementing effective measures to prevent the spread of hateful speech.

History: This paper has been accepted by Kaushik Dutta for the Special Issue on Responsible AI and Data Science for Social Good.

Funding: This work was supported by the National Natural Science Foundation of China [Grants 72225011, 72434005 and 72293575].

Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information (https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0508) as well as from the IJOC GitHub software repository (https://github.com/INFORMSJoC/2023.0508). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.