The rising adoption of deep Neural Language Models (NLMs) has caused concerns about the presence of social biases and their potential societal impact. While a substantial body of work has documented biased behaviors in model outputs, the question of where such biases originate within the language modeling pipeline remains only partially understood. This thesis introduces a method for detecting and quantifying social biases in pre-trained language models that aims to balance interpretability and statistical rigor. The proposed approach tests whether the information encoded in embedded representations of protected attributes (e.g., gender, nationality, religion) can be used to predict stereotyped attributes through a simple supervised classification task. Requiring only a minimal labeled dataset, this method provides an accessible way to probe representational biases. Experimental results on several Transformer-based models reveal consistent associations between protected and stereotyped properties. In addition, a complementary visualization-based technique is introduced to support qualitative inspection of bias patterns. Building on this methodological framework, the thesis also conducts a systematic analysis of recent literature on bias in language models, with the aim of exploring the context and mechanisms through which the phenomenon manifests. A central motivation is to move beyond the common assumption that bias is solely a reflection of training data; therefore, the resulting analysis is organized around four complementary perspectives. First, it examines the role of training and fine-tuning corpora in shaping measurable bias. Second, it analyzes how bias can emerge, evolve, or be amplified during the training process itself. Third, it considers model-internal factors, investigating how biases are encoded and propagated through parameters, representations, and architectural components. Finally, it discusses evidence on model scale and complexity, assessing whether certain forms of bias appear or intensify only beyond specific thresholds of capacity or model size. Overall, this thesis provides a structured synthesis of current knowledge on bias origins in language models, while offering a practical tool for their empirical assessment. The ultimate goal is to provide actionable insights that support the development of fairer and more inclusive NLP technologies.

Analysis and detection of social biases in deep neural language models / Dusi, Michele. - (2026 May 18).

Analysis and detection of social biases in deep neural language models

DUSI, MICHELE
18/05/2026

Abstract

The rising adoption of deep Neural Language Models (NLMs) has caused concerns about the presence of social biases and their potential societal impact. While a substantial body of work has documented biased behaviors in model outputs, the question of where such biases originate within the language modeling pipeline remains only partially understood. This thesis introduces a method for detecting and quantifying social biases in pre-trained language models that aims to balance interpretability and statistical rigor. The proposed approach tests whether the information encoded in embedded representations of protected attributes (e.g., gender, nationality, religion) can be used to predict stereotyped attributes through a simple supervised classification task. Requiring only a minimal labeled dataset, this method provides an accessible way to probe representational biases. Experimental results on several Transformer-based models reveal consistent associations between protected and stereotyped properties. In addition, a complementary visualization-based technique is introduced to support qualitative inspection of bias patterns. Building on this methodological framework, the thesis also conducts a systematic analysis of recent literature on bias in language models, with the aim of exploring the context and mechanisms through which the phenomenon manifests. A central motivation is to move beyond the common assumption that bias is solely a reflection of training data; therefore, the resulting analysis is organized around four complementary perspectives. First, it examines the role of training and fine-tuning corpora in shaping measurable bias. Second, it analyzes how bias can emerge, evolve, or be amplified during the training process itself. Third, it considers model-internal factors, investigating how biases are encoded and propagated through parameters, representations, and architectural components. Finally, it discusses evidence on model scale and complexity, assessing whether certain forms of bias appear or intensify only beyond specific thresholds of capacity or model size. Overall, this thesis provides a structured synthesis of current knowledge on bias origins in language models, while offering a practical tool for their empirical assessment. The ultimate goal is to provide actionable insights that support the development of fairer and more inclusive NLP technologies.
18-mag-2026
Gerevini, Alfonso Emilio; Serina, Ivan; Putelli, Luca
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_Dusi.pdf

accesso aperto

Note: tesi completa
Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 3.15 MB
Formato Adobe PDF
3.15 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768249
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact