Catalogo dei prodotti della ricerca

The rising adoption of deep Neural Language Models (NLMs) has caused concerns about the presence of social biases and their potential societal impact. While a substantial body of work has documented biased behaviors in model outputs, the question of where such biases originate within the language modeling pipeline remains only partially understood. This thesis introduces a method for detecting and quantifying social biases in pre-trained language models that aims to balance interpretability and statistical rigor. The proposed approach tests whether the information encoded in embedded representations of protected attributes (e.g., gender, nationality, religion) can be used to predict stereotyped attributes through a simple supervised classification task. Requiring only a minimal labeled dataset, this method provides an accessible way to probe representational biases. Experimental results on several Transformer-based models reveal consistent associations between protected and stereotyped properties. In addition, a complementary visualization-based technique is introduced to support qualitative inspection of bias patterns. Building on this methodological framework, the thesis also conducts a systematic analysis of recent literature on bias in language models, with the aim of exploring the context and mechanisms through which the phenomenon manifests. A central motivation is to move beyond the common assumption that bias is solely a reflection of training data; therefore, the resulting analysis is organized around four complementary perspectives. First, it examines the role of training and fine-tuning corpora in shaping measurable bias. Second, it analyzes how bias can emerge, evolve, or be amplified during the training process itself. Third, it considers model-internal factors, investigating how biases are encoded and propagated through parameters, representations, and architectural components. Finally, it discusses evidence on model scale and complexity, assessing whether certain forms of bias appear or intensify only beyond specific thresholds of capacity or model size. Overall, this thesis provides a structured synthesis of current knowledge on bias origins in language models, while offering a practical tool for their empirical assessment. The ultimate goal is to provide actionable insights that support the development of fairer and more inclusive NLP technologies.

Analysis and detection of social biases in deep neural language models / Dusi, Michele. - (2026 May 18).

Analysis and detection of social biases in deep neural language models

DUSI, MICHELE

18/05/2026

Abstract

The rising adoption of deep Neural Language Models (NLMs) has caused concerns about the presence of social biases and their potential societal impact. While a substantial body of work has documented biased behaviors in model outputs, the question of where such biases originate within the language modeling pipeline remains only partially understood. This thesis introduces a method for detecting and quantifying social biases in pre-trained language models that aims to balance interpretability and statistical rigor. The proposed approach tests whether the information encoded in embedded representations of protected attributes (e.g., gender, nationality, religion) can be used to predict stereotyped attributes through a simple supervised classification task. Requiring only a minimal labeled dataset, this method provides an accessible way to probe representational biases. Experimental results on several Transformer-based models reveal consistent associations between protected and stereotyped properties. In addition, a complementary visualization-based technique is introduced to support qualitative inspection of bias patterns. Building on this methodological framework, the thesis also conducts a systematic analysis of recent literature on bias in language models, with the aim of exploring the context and mechanisms through which the phenomenon manifests. A central motivation is to move beyond the common assumption that bias is solely a reflection of training data; therefore, the resulting analysis is organized around four complementary perspectives. First, it examines the role of training and fine-tuning corpora in shaping measurable bias. Second, it analyzes how bias can emerge, evolve, or be amplified during the training process itself. Third, it considers model-internal factors, investigating how biases are encoded and propagated through parameters, representations, and architectural components. Finally, it discusses evidence on model scale and complexity, assessing whether certain forms of bias appear or intensify only beyond specific thresholds of capacity or model size. Overall, this thesis provides a structured synthesis of current knowledge on bias origins in language models, while offering a practical tool for their empirical assessment. The ultimate goal is to provide actionable insights that support the development of fairer and more inclusive NLP technologies.

Scheda breve

Scheda completa

	Data di discussione
	
				18-mag-2026
			
	Tutor esterni
	
				Gerevini, Alfonso Emilio; Serina, Ivan; Putelli, Luca
			
	Appartiene alla tipologia:
	
				07a Tesi di Dottorato

File allegati a questo prodotto

File	Dimensione	Formato
Tesi_dottorato_Dusi.pdf accesso aperto Note: tesi completa Tipologia: Tesi di dottorato Licenza: Creative commons Dimensione 3.15 MB Formato Adobe PDF	3.15 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768249

Citazioni

ND

ND

ND

social impact