Abstract. Interpretability and robustness remain major challenges for modern Large Language Models, especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose Inverse Language Modeling (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradientaligned internal representations that allow the model to approximateplausible input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and providing a diagnostic signal that may support future auditing workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient, but also more transparent and analyzable. Code available at https://github.com/davegabe/pag-llm.

Inverse Language Modeling Towards Robust and Grounded LLMs / Gabrielli, Davide; Sestito, Simone; Masi, Iacopo. - (2026). ( AAAI 2026 Singapore ).

Inverse Language Modeling Towards Robust and Grounded LLMs

Davide Gabrielli
;
Simone Sestito;Iacopo Masi
2026

Abstract

Abstract. Interpretability and robustness remain major challenges for modern Large Language Models, especially in settings where conventional evaluation or auditing tools are limited. To address this, we propose Inverse Language Modeling (ILM), a unified training framework that jointly enhances robustness to adversarial perturbations and enables a novel form of gradient-based interpretability. Rather than reconstructing exact input prompts, ILM encourages LLMs to develop gradientaligned internal representations that allow the model to approximateplausible input patterns underlying a given output. This approximate inversion provides a new mechanism for analyzing model behavior, identifying potential triggers for unsafe generations, and providing a diagnostic signal that may support future auditing workflows. Our results show that ILM can simultaneously improve robustness and produce meaningful inversion signals, laying a foundation for LLMs that are not only more resilient, but also more transparent and analyzable. Code available at https://github.com/davegabe/pag-llm.
2026
AAAI 2026
LLM · Invertibility · Adversarial Training · Gradients · Robustness
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Inverse Language Modeling Towards Robust and Grounded LLMs / Gabrielli, Davide; Sestito, Simone; Masi, Iacopo. - (2026). ( AAAI 2026 Singapore ).
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1763250
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact