Catalogo dei prodotti della ricerca

Abstract. Interpretability and robustness remain major challenges for modern Large Language Models, especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose Inverse Language Modeling (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradientaligned internal representations that allow the model to approximateplausible input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and providing a diagnostic signal that may support future auditing workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient, but also more transparent and analyzable. Code available at https://github.com/davegabe/pag-llm.

Inverse Language Modeling Towards Robust and Grounded LLMs / Gabrielli, Davide; Sestito, Simone; Masi, Iacopo. - (2026). ( AAAI 2026 Singapore ).

Inverse Language Modeling Towards Robust and Grounded LLMs

Davide Gabrielli;Simone Sestito;Iacopo Masi

2026

Abstract

Abstract. Interpretability and robustness remain major challenges for modern Large Language Models, especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose Inverse Language Modeling (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradientaligned internal representations that allow the model to approximateplausible input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and providing a diagnostic signal that may support future auditing workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient, but also more transparent and analyzable. Code available at https://github.com/davegabe/pag-llm.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2026
			
	Nome convegno
	
				AAAI 2026
			
	Parole chiave
	
				LLM · Invertibility · Adversarial Training · Gradients · Robustness
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Inverse Language Modeling Towards Robust and Grounded LLMs / Gabrielli, Davide; Sestito, Simone; Masi, Iacopo. - (2026). ( AAAI 2026 Singapore ).

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1763250

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact