Semantic Aggregated Adversarial Training Framework for Hate Speech Detection

Published Online:https://doi.org/10.1287/ijoc.2023.0508

References

  • Chen W, Yang K, Yu Z, Shi Y, Chen CP (2024) A survey on imbalanced learning: Latest research, applications and future directions. Artificial Intelligence Rev. 57(6):137.CrossrefGoogle Scholar
  • Chiu KL, Collins A, Alexander R (2021) Detecting hate speech with GPT-3. Preprint, submitted March 23, https://arxiv.org/abs/2103.12407.Google Scholar
  • Cohen S, Presil D, Katz O, Arbili O, Messica S, Rokach L (2023) Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time. Inform. Fusion 99:101887.CrossrefGoogle Scholar
  • Dinan E, Humeau S, Chintagunta B, Weston J (2019) Build it break it fix it for dialogue safety: Robustness from adversarial human attack. Proc. 2019 Conf. Empirical Methods Natural Language Processing (ACL, Stroudsburg, PA), 4537–4546.Google Scholar
  • Dobriban E, Hassani H, Hong D, Robey A (2023) Provable tradeoffs in adversarially robust classification. IEEE Trans. Inform. Theory 69(12):7793–7822.CrossrefGoogle Scholar
  • ElSherief M, Ziems C, Muchlinski D, Anupindi V, Seybolt J, De Choudhury M, Yang D (2021) Latent Hatred: A benchmark for understanding implicit hate speech. Proc. 2021 Conf. Empirical Methods Natural Language Processing (ACL, Stroudsburg, PA), 345–363.Google Scholar
  • Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput. Surveys 51(4):85.Google Scholar
  • Gitari ND, Zuping Z, Damien H, Long J (2015) A lexicon-based approach for hate speech detection. Internat. J. Multimedia Ubiquitous Engrg. 10(4):215–230.CrossrefGoogle Scholar
  • Goodfellow Ian JS, Szegedy C (2015) Explaining and harnessing adversarial examples. Proc. 3rd Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Goyal S, Doddapaneni S, Khapra MM, Ravindran B (2023) A survey of adversarial defenses and robustness in NLP. ACM Comput. Surveys 55(14S):332.CrossrefGoogle Scholar
  • Grolman E, Binyamini H, Shabtai A, Elovici Y, Morikawa I, Shimizu T (2022) HateVersarial: Adversarial attack against hate speech detection algorithms on Twitter. Proc. 30th ACM Conf. User Model. Adaptation Personalization (ACM, New York), 143–152.Google Scholar
  • Huang T (2025) Content moderation by LLM: From accuracy to legitimacy. Artificial Intelligence Rev. 58(10):1–32.CrossrefGoogle Scholar
  • Huang F, Kwak H, An J (2023) Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. Companion Proc. ACM Web Conf. (ACM, New York), 294–297.Google Scholar
  • Kocoń J, Figas A, Gruza M, Puchalska D, Kajdanowicz T, Kazienko P (2021) Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach. Inform. Processing Management 58(5):102643.CrossrefGoogle Scholar
  • Lee S, Lee H, Yoon S (2020) Adversarial Vertex Mixup: Toward better adversarially robust generalization. Proc. 33rd IEEE/CVF Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 272–281.Google Scholar
  • Li J, Ji S, Du T, Li B, Wang T (2019) TextBugger: Generating adversarial text against real-world applications. Proc. 26th Annual Network Distributed System Security Sympos. (Internet Society, San Diego).Google Scholar
  • Liu H, Zhang Y, Wang Y, Lin Z, Chen Y (2020) Joint character-level word embedding and adversarial stability training to defend adversarial text. Proc. 34th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 8384–8391.Google Scholar
  • Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. Proc. 6th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Mathew B, Saha P, Yimam SM, Biemann C, Goyal P, Mukherjee A (2021) HateXplain: A benchmark dataset for explainable hate speech detection. Proc. 35th AAAI Conf. Artificial Intelligence (AAAI Press, Palo Alto, CA), 14867–14875.Google Scholar
  • Mazari AC, Boudoukhani N, Djeffal A (2024) BERT-based ensemble learning for multi-aspect hate speech detection. Cluster Comput. 27(1):325–339.CrossrefGoogle Scholar
  • Ocampo NB, Cabrio E, Villata S (2023) Playing the part of the sharp bully: Generating adversarial examples for implicit hate speech detection. Rogers A, Boyd-Graber J, Okazaki N, eds. Findings of the Association for Computational Linguistics: ACL 2023 (Association for Computational Linguistics, Toronto), 2758–2772.CrossrefGoogle Scholar
  • Omar M, Mohaisen D (2022) Making adversarially-trained language models forget with model retraining: A case study on hate speech detection. Companion Proc. Web Conf. (ACM, New York), 887–893.Google Scholar
  • Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, et al. (2022) Training language models to follow instructions with human feedback. Proc. 35th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 27730–27744.Google Scholar
  • Parker S, Ruths D (2023) Is hate speech detection the solution the world wants? Proc. Natl. Acad. Sci. USA 120(10):e2209384120.CrossrefGoogle Scholar
  • Pfeffer J, Matter D, Jaidka K, Varol O, Mashhadi A, Lasser J, Assenmacher D, et al. (2023) Just another day on Twitter: A complete 24 hours of Twitter data. Proc. Internat. AAAI Conf. Web Social Media, vol. 17 (AAAI Press, Palo Alto, CA), 1073–1081.Google Scholar
  • Rathpisey H, Adji TB (2019) Handling imbalance issue in hate speech classification using sampling-based methods. Proc. 5th Internat. Conf. Sci. Inform. Tech. (IEEE, Piscataway, NJ), 193–198.Google Scholar
  • Saha P, Garimella K, Kalyan NK, Pandey SK, Meher PM, Mathew B, Mukherjee A (2023) On the rise of fear speech in online social media. Proc. Natl. Acad. Sci. USA 120(11):e2212270120.CrossrefGoogle Scholar
  • Schillaci Z (2024) On-site deployment of LLMs. Kucharavy A, Plancherel O, Mulder V, Mermoud A, Lenders V, eds. Large Language Models in Cybersecurity (Springer, Cham, Switzerland), 205–211.CrossrefGoogle Scholar
  • Shelby R, Rismani S, Henne K, Moon A, Rostamzadeh N, Nicholas P, Yilla-Akbari N, et al. (2023) Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proc. 7th AAAI/ACM Conf. AI Ethics Society (ACM, New York), 723–741.Google Scholar
  • Tomašev N, Cornebise J, Hutter F, Mohamed S, Picciariello A, Connelly B, Belgrave DC, et al. (2020) AI for social good: Unlocking the opportunity for positive impact. Nature Comm. 11(1):2468.CrossrefGoogle Scholar
  • Ullah S, Han M, Pujar S, Pearce H, Coskun A, Stringhini G (2024) LLMs cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. 2024 IEEE Sympos. Security Privacy (IEEE, Piscataway, NJ), 862–880.Google Scholar
  • Wang Y, Sun T, Yuan X, Li S, Ni W (2024) Minimizing adversarial training samples for robust image classifiers: Analysis and adversarial example generator design. IEEE Trans. Inform. Forensics Security 19:9613–9628.CrossrefGoogle Scholar
  • Wang W, Shomer H, Wan Y, Li Y, Huang J, Liu H (2023) A mix-up strategy to enhance adversarial training with imbalanced data. Proc. 32nd ACM Internat. Conf. Inform. Knowledge Management (ACM, New York), 2637–2645.Google Scholar
  • Wang W, Xu H, Liu X, Li Y, Thuraisingham B, Tang J (2022) Imbalanced adversarial training with reweighting. Proc. 22nd IEEE Internat. Conf. Data Mining (IEEE, Piscataway, NJ), 1209–1214.Google Scholar
  • Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. Eur. Conf. Comput. Vision (Springer, Berlin), 499–515.Google Scholar
  • Xiao J, Tian Y, Jia Y, Jiang X, Yu L, Wang S (2023) Black-box attack-based security evaluation framework for credit card fraud detection models. INFORMS J. Comput. 35(5):986–1001.LinkGoogle Scholar
  • Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. Proc. 2016 Conf. North Amer. Chapter Assoc. Comput. Linguistics Human Language Tech. (ACL, Stroudsburg, PA), 1480–1489.Google Scholar
  • Yuan L, Chen Y, Cui G, Gao H, Zou F, Cheng X, Ji H, Liu Z, Sun M (2024) Revisiting out-of-distribution robustness in NLP: Benchmarks, analysis, and LLMs evaluations. Proc. 36th Internat. Conf. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1–30.Google Scholar
  • Zhang Z, Luo L (2019) Hate speech detection: A solved problem? The challenging case of long tail on Twitter. Semantic Web 10(5):925–945.Google Scholar
  • Zhang X, Zheng X, Mao W (2021) Adversarial perturbation defense on deep neural networks. ACM Comput. Surveys 54(8):159.Google Scholar
  • Zhang X, Tian H, Zheng X, Peng J, Zeng DD (2026) Semantic aggregated adversarial training framework for hate speech detection. https://doi.org/10.1287/ijoc.2023.0508.cd, https://github.com/INFORMSJoC/2023.0508.Google Scholar
  • Zhang H, Yu Y, Jiao J, Xing E, El Ghaoui L, Jordan M (2019) Theoretically principled trade-off between robustness and accuracy. Proc. 36th Internat. Conf. Machine Learn. (PMLR, New York), 7472–7482.Google Scholar
  • Zhang J, Zhu J, Niu G, Han B, Sugiyama M, Kankanhalli M (2020) Geometry-aware instance-reweighted adversarial training. Proc. 8th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
  • Zhu C, Cheng Y, Gan Z, Sun S, Goldstein T, Liu J (2020) FreeLB: Enhanced adversarial training for natural language understanding. Proc. 8th Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.