Hate speech poses a growing challenge to digital platforms with large and diverse user bases, prompting widespread adoption of deep learning (DL) models for automated detection at scale. Existing research, however, predominantly focuses on improving detection accuracy while paying limited attention to the vulnerability of DL-based detection models to adversarial attacks from malicious spreaders. To bridge this gap, we propose an adversarial training framework to improve the adversarial robustness of hate speech detection. This framework integrates imbalanced adversarial training with a novel semantic aggregation technology to learn robust yet discriminative features from hate speech corpora. We further introduce an adversarial attack generation framework to assess the performance of existing DL-based hate speech detection models under such attacks. Extensive computational experiments conducted on eight publicly available hate speech corpora demonstrate the robustness of the proposed method against attacks. In contrast, we show that existing DL-based detection models can be easily circumvented by adversarial attacks, allowing the dissemination of hateful sentiments through subtle modifications to the content. Additionally, we conduct comparative analyses of the proposed method with various adversarial training and imbalance training methods to illustrate its effectiveness in simultaneously addressing the data imbalance and feature inseparability issues inherent in hate speech detection. This study presents significant managerial implications, aiding online platforms in implementing effective measures to prevent the spread of hateful speech.

History: This paper has been accepted by Kaushik Dutta for the Special Issue on Responsible AI and Data Science for Social Good.

Funding: This work was supported by the National Natural Science Foundation of China [Grants 72225011, 72434005 and 72293575].

Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information (https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0508) as well as from the IJOC GitHub software repository (https://github.com/INFORMSJoC/2023.0508). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/.

cover image INFORMS Journal on Computing

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Received:December 30, 2023
Accepted:April 02, 2026
Published Online:May 13, 2026

Cite as

Xingwei Zhang, Hu Tian, Xiaolong Zheng, Jing Peng, Daniel Dajun Zeng (2026) Semantic Aggregated Adversarial Training Framework for Hate Speech Detection. INFORMS Journal on Computing 0(0).

https://doi.org/10.1287/ijoc.2023.0508

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Semantic Aggregated Adversarial Training Framework for Hate Speech Detection

Abstract

Articles In Advance

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News