The analysis of the presence of bias, prejudices and unwanted discriminatory behavior in pre-trained neural language models (NLMs), considering the sensitivity of the topic and its public interest, should respect two main criteria: the intuition and the statistical rigor. To the state of the art, there are two main categories of approaches for analyzing bias: those based on the models’ textual output, and those based on the geometric space of the embedded representations calculated by the NLMs. While the first one is intuitive, this kind of analysis is often conducted on simple template sentences, which limit the overall validity of their conclusions in a real-world context. On the contrary, geometric methods are more rigorous but quite more complex to implement and understand for those who are non-experts in Natural Language Processing (NLP). In this paper, we propose a unique method for analyzing bias in pre-trained language models that combines these two aspects. Through a simple classification task, we verify whether the information contained in the embedded representation of words that describes a protected property (such as the religion) can be used to identify a stereotyped property (such as the criminal behavior), requiring only a minimal supervised dataset. We experimentally verify our approach, finding that four widespread Transformer-based models are affected by prejudices of gender, nationality, and religion.

Discrimination Bias Detection through Categorical Association in Pre-trained Language Models / Dusi, M.; Arici, N.; Gerevini, A. E.; Putelli, L.; Serina, I.. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 162651-162667. [10.1109/ACCESS.2024.3482010]

Discrimination Bias Detection through Categorical Association in Pre-trained Language Models

Dusi M.
Primo
;
Gerevini A. E.;
2024

Abstract

The analysis of the presence of bias, prejudices and unwanted discriminatory behavior in pre-trained neural language models (NLMs), considering the sensitivity of the topic and its public interest, should respect two main criteria: the intuition and the statistical rigor. To the state of the art, there are two main categories of approaches for analyzing bias: those based on the models’ textual output, and those based on the geometric space of the embedded representations calculated by the NLMs. While the first one is intuitive, this kind of analysis is often conducted on simple template sentences, which limit the overall validity of their conclusions in a real-world context. On the contrary, geometric methods are more rigorous but quite more complex to implement and understand for those who are non-experts in Natural Language Processing (NLP). In this paper, we propose a unique method for analyzing bias in pre-trained language models that combines these two aspects. Through a simple classification task, we verify whether the information contained in the embedded representation of words that describes a protected property (such as the religion) can be used to identify a stereotyped property (such as the criminal behavior), requiring only a minimal supervised dataset. We experimentally verify our approach, finding that four widespread Transformer-based models are affected by prejudices of gender, nationality, and religion.
2024
AI Fairness; Bias Detection; Contextual Word Embedding; Ethics of AI; Language Models; Natural Language Processing
01 Pubblicazione su rivista::01a Articolo in rivista
Discrimination Bias Detection through Categorical Association in Pre-trained Language Models / Dusi, M.; Arici, N.; Gerevini, A. E.; Putelli, L.; Serina, I.. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 162651-162667. [10.1109/ACCESS.2024.3482010]
File allegati a questo prodotto
File Dimensione Formato  
Dusi_Discrimination_2024.pdf

accesso aperto

Note: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10719988
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 3.92 MB
Formato Adobe PDF
3.92 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1725661
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact