Moderating harmful content, such as misogynistic language, is essential to ensure safety and well-being in online spaces. To this end, text classification models have been used to detect toxic content, especially in communities that are known to promote violence and radicalization in the real world, such as the incel movement. However, these models remain vulnerable to targeted data poisoning attacks. In this work, we present a realistic targeted poisoning strategy in which an adversary aims at misclassifying specific misogynistic comments in order to evade detection. While prior approaches craft poisoned samples with explicit trigger phrases, our method relies exclusively on existing training data. In particular, we repurpose the concept of opponents, training points that negatively influence the prediction of a target test point, to identify poisoned points to be added to the training set, either in their original form or in a paraphrased variant. The effectiveness of the attack is then measured on several aspects: success rate, number of poisoned samples required, and preservation of the overall model performance. Our results on two different datasets show that only a small fraction of malicious inputs are possibly sufficient to undermine classification of a target sample, while leaving the model performance on non-target points virtually unaffected, revealing the stealthy nature of the attack. Finally, we show that the attack can be transferred across different models, thus highlighting its practical relevance in real-world scenarios. Overall, our work raises awareness on the vulnerability of powerful machine learning models to data poisoning attacks, and will possibly encourage the development of efficient defense and mitigation techniques to strengthen the security of automated moderation systems.

The perils of stealthy data poisoning attacks in misogynistic content moderation / Enneifer, Syrine; Baccini, Federica; Siciliano, Federico; Amerini, Irene; Silvestri, Fabrizio. - In: ONLINE SOCIAL NETWORKS AND MEDIA. - ISSN 2468-6964. - 50:(2025). [10.1016/j.osnem.2025.100334]

The perils of stealthy data poisoning attacks in misogynistic content moderation

Enneifer, Syrine
Primo
Methodology
;
Baccini, Federica
Investigation
;
Siciliano, Federico
Investigation
;
Amerini, Irene
Supervision
;
Silvestri, Fabrizio
Supervision
2025

Abstract

Moderating harmful content, such as misogynistic language, is essential to ensure safety and well-being in online spaces. To this end, text classification models have been used to detect toxic content, especially in communities that are known to promote violence and radicalization in the real world, such as the incel movement. However, these models remain vulnerable to targeted data poisoning attacks. In this work, we present a realistic targeted poisoning strategy in which an adversary aims at misclassifying specific misogynistic comments in order to evade detection. While prior approaches craft poisoned samples with explicit trigger phrases, our method relies exclusively on existing training data. In particular, we repurpose the concept of opponents, training points that negatively influence the prediction of a target test point, to identify poisoned points to be added to the training set, either in their original form or in a paraphrased variant. The effectiveness of the attack is then measured on several aspects: success rate, number of poisoned samples required, and preservation of the overall model performance. Our results on two different datasets show that only a small fraction of malicious inputs are possibly sufficient to undermine classification of a target sample, while leaving the model performance on non-target points virtually unaffected, revealing the stealthy nature of the attack. Finally, we show that the attack can be transferred across different models, thus highlighting its practical relevance in real-world scenarios. Overall, our work raises awareness on the vulnerability of powerful machine learning models to data poisoning attacks, and will possibly encourage the development of efficient defense and mitigation techniques to strengthen the security of automated moderation systems.
2025
Data poisoning; Generative AI; Misogyny detection; Natural language processing; Online harms; Text classification; TracIn
01 Pubblicazione su rivista::01a Articolo in rivista
The perils of stealthy data poisoning attacks in misogynistic content moderation / Enneifer, Syrine; Baccini, Federica; Siciliano, Federico; Amerini, Irene; Silvestri, Fabrizio. - In: ONLINE SOCIAL NETWORKS AND MEDIA. - ISSN 2468-6964. - 50:(2025). [10.1016/j.osnem.2025.100334]
File allegati a questo prodotto
File Dimensione Formato  
Enneifer_The-perils_2025.pdf

accesso aperto

Note: https://doi.org/10.1016/j.osnem.2025.100334
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 4 MB
Formato Adobe PDF
4 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1750953
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact