The analysis of the presence of bias, prejudices and unwanted discriminatory behavior in pre-trained neural language models (NLMs), considering the sensitivity of the topic and its public interest, should respect two main criteria: the intuition and the statistical rigor. To the state of the art, there are two main categories of approaches for analyzing bias: those based on the models’ textual output, and those based on the geometric space of the embedded representations calculated by the NLMs. While the first one is intuitive, this kind of analysis is often conducted on simple template sentences, which limit the overall validity of their conclusions in a real-world context. On the contrary, geometric methods are more rigorous but quite more complex to implement and understand for those who are non-experts in Natural Language Processing (NLP). In this paper, we propose a unique method for analyzing bias in pre-trained language models that combines these two aspects. Through a simple classification task, we verify whether the information contained in the embedded representation of words that describes a protected property (such as the religion) can be used to identify a stereotyped property (such as the criminal behavior), requiring only a minimal supervised dataset. We experimentally verify our approach, finding that four widespread Transformer-based models are affected by prejudices of gender, nationality, and religion.
Discrimination Bias Detection through Categorical Association in Pre-trained Language Models / Dusi, M.; Arici, N.; Gerevini, A. E.; Putelli, L.; Serina, I.. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 162651-162667. [10.1109/ACCESS.2024.3482010]
Discrimination Bias Detection through Categorical Association in Pre-trained Language Models
Dusi M.
Primo
;Gerevini A. E.;
2024
Abstract
The analysis of the presence of bias, prejudices and unwanted discriminatory behavior in pre-trained neural language models (NLMs), considering the sensitivity of the topic and its public interest, should respect two main criteria: the intuition and the statistical rigor. To the state of the art, there are two main categories of approaches for analyzing bias: those based on the models’ textual output, and those based on the geometric space of the embedded representations calculated by the NLMs. While the first one is intuitive, this kind of analysis is often conducted on simple template sentences, which limit the overall validity of their conclusions in a real-world context. On the contrary, geometric methods are more rigorous but quite more complex to implement and understand for those who are non-experts in Natural Language Processing (NLP). In this paper, we propose a unique method for analyzing bias in pre-trained language models that combines these two aspects. Through a simple classification task, we verify whether the information contained in the embedded representation of words that describes a protected property (such as the religion) can be used to identify a stereotyped property (such as the criminal behavior), requiring only a minimal supervised dataset. We experimentally verify our approach, finding that four widespread Transformer-based models are affected by prejudices of gender, nationality, and religion.File | Dimensione | Formato | |
---|---|---|---|
Dusi_Discrimination_2024.pdf
accesso aperto
Note: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10719988
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
3.92 MB
Formato
Adobe PDF
|
3.92 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.