Semantic Aggregated Adversarial Training Framework for Hate Speech Detection
Published Online:13 May 2026https://doi.org/10.1287/ijoc.2023.0508
References
- (2024) A survey on imbalanced learning: Latest research, applications and future directions. Artificial Intelligence Rev. 57(6):137.Crossref, Google Scholar
- (2021) Detecting hate speech with GPT-3. Preprint, submitted March 23, https://arxiv.org/abs/2103.12407.Google Scholar
- (2023) Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time. Inform. Fusion 99:101887.Crossref, Google Scholar
- (2019) Build it break it fix it for dialogue safety: Robustness from adversarial human attack. Proc. 2019 Conf. Empirical Methods Natural Language Processing (ACL, Stroudsburg, PA), 4537–4546.Google Scholar
- (2023) Provable tradeoffs in adversarially robust classification. IEEE Trans. Inform. Theory 69(12):7793–7822.Crossref, Google Scholar
- (2021) Latent Hatred: A benchmark for understanding implicit hate speech. Proc. 2021 Conf. Empirical Methods Natural Language Processing (ACL, Stroudsburg, PA), 345–363.Google Scholar
- (2018) A survey on automatic detection of hate speech in text. ACM Comput. Surveys 51(4):85.Google Scholar
- (2015) A lexicon-based approach for hate speech detection. Internat. J. Multimedia Ubiquitous Engrg. 10(4):215–230.Crossref, Google Scholar
- (2015) Explaining and harnessing adversarial examples. Proc. 3rd Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2023) A survey of adversarial defenses and robustness in NLP. ACM Comput. Surveys 55(14S):332.Crossref, Google Scholar
- (2022) HateVersarial: Adversarial attack against hate speech detection algorithms on Twitter. Proc. 30th ACM Conf. User Model. Adaptation Personalization (ACM, New York), 143–152.Google Scholar
- (2025) Content moderation by LLM: From accuracy to legitimacy. Artificial Intelligence Rev. 58(10):1–32.Crossref, Google Scholar
- (2023) Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. Companion Proc. ACM Web Conf. (ACM, New York), 294–297.Google Scholar
- (2021) Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach. Inform. Processing Management 58(5):102643.Crossref, Google Scholar
- (2020) Adversarial Vertex Mixup: Toward better adversarially robust generalization. Proc. 33rd IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 272–281.Google Scholar
- (2019) TextBugger: Generating adversarial text against real-world applications. Proc. 26th Annual Network Distributed System Security Sympos. (Internet Society, San Diego).Google Scholar
- (2020) Joint character-level word embedding and adversarial stability training to defend adversarial text. Proc. 34th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 8384–8391.Google Scholar
- (2018) Towards deep learning models resistant to adversarial attacks. Proc. 6th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2021) HateXplain: A benchmark dataset for explainable hate speech detection. Proc. 35th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 14867–14875.Google Scholar
- (2024) BERT-based ensemble learning for multi-aspect hate speech detection. Cluster Comput. 27(1):325–339.Crossref, Google Scholar
- (2023) Playing the part of the sharp bully: Generating adversarial examples for implicit hate speech detection. Rogers A, Boyd-Graber J, Okazaki N, eds. Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Toronto), 2758–2772.Crossref, Google Scholar
- (2022) Making adversarially-trained language models forget with model retraining: A case study on hate speech detection. Companion Proc. Web Conf. (ACM, New York), 887–893.Google Scholar
- (2022) Training language models to follow instructions with human feedback. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 27730–27744.Google Scholar
- (2023) Is hate speech detection the solution the world wants? Proc. Natl. Acad. Sci. USA 120(10):e2209384120.Crossref, Google Scholar
- (2023) Just another day on Twitter: A complete 24 hours of Twitter data. Proc. Internat. AAAI Conf. Web Social Media, vol. 17 (AAAI Press, Palo Alto, CA), 1073–1081.Google Scholar
- (2019) Handling imbalance issue in hate speech classification using sampling-based methods. Proc. 5th Internat. Conf. Sci. Inform. Tech. (IEEE, Piscataway, NJ), 193–198.Google Scholar
- (2023) On the rise of fear speech in online social media. Proc. Natl. Acad. Sci. USA 120(11):e2212270120.Crossref, Google Scholar
- (2024) On-site deployment of LLMs. Kucharavy A, Plancherel O, Mulder V, Mermoud A, Lenders V, eds. Large Language Models in Cybersecurity (Springer, Cham, Switzerland), 205–211.Crossref, Google Scholar
- (2023) Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proc. 7th AAAI/ACM Conf. AI Ethics Society (ACM, New York), 723–741.Google Scholar
- (2020) AI for social good: Unlocking the opportunity for positive impact. Nature Comm. 11(1):2468.Crossref, Google Scholar
- (2024) LLMs cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. 2024 IEEE Sympos. Security Privacy (IEEE, Piscataway, NJ), 862–880.Google Scholar
- (2024) Minimizing adversarial training samples for robust image classifiers: Analysis and adversarial example generator design. IEEE Trans. Inform. Forensics Security 19:9613–9628.Crossref, Google Scholar
- (2023) A mix-up strategy to enhance adversarial training with imbalanced data. Proc. 32nd ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 2637–2645.Google Scholar
- (2022) Imbalanced adversarial training with reweighting. Proc. 22nd IEEE Internat. Conf. Data Mining (IEEE, Piscataway, NJ), 1209–1214.Google Scholar
- (2016) A discriminative feature learning approach for deep face recognition. Eur. Conf. Comput. Vision (Springer, Berlin), 499–515.Google Scholar
- (2023) Black-box attack-based security evaluation framework for credit card fraud detection models. INFORMS J. Comput. 35(5):986–1001.Link, Google Scholar
- (2016) Hierarchical attention networks for document classification. Proc. 2016 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 1480–1489.Google Scholar
- (2024) Revisiting out-of-distribution robustness in NLP: Benchmarks, analysis, and LLMs evaluations. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1–30.Google Scholar
- (2019) Hate speech detection: A solved problem? The challenging case of long tail on Twitter. Semantic Web 10(5):925–945.Google Scholar
- (2021) Adversarial perturbation defense on deep neural networks. ACM Comput. Surveys 54(8):159.Google Scholar
- (2026) Semantic aggregated adversarial training framework for hate speech detection. https://doi.org/10.1287/ijoc.2023.0508.cd, https://github.com/INFORMSJoC/2023.0508.Google Scholar
- (2019) Theoretically principled trade-off between robustness and accuracy. Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York), 7472–7482.Google Scholar
- (2020) Geometry-aware instance-reweighted adversarial training. Proc. 8th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
- (2020) FreeLB: Enhanced adversarial training for natural language understanding. Proc. 8th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar

