Catalogo dei prodotti della ricerca

Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering.

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models / Mirza, Mujtaba Hussain; D'Orazio, Antonio; Melamed, Odelia; Masi, Iacopo. - (2026). ( IEEE/CVF Computer Vision and Pattern Recognition Denver, USA ).

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Mujtaba Hussain Mirza;Antonio D'Orazio;Odelia Melamed;Iacopo Masi

2026

Abstract

Despite the rapid progress in multimodal models and Large Visual-Language Models (LVLM), they remain highly susceptible to adversarial perturbations, raising serious concerns about their reliability in real-world use. While adversarial training has become the leading paradigm for building models that are robust to adversarial attacks, Test-Time Transformations (TTT) have emerged as a promising strategy to boost robustness at inference. In light of this, we propose Energy-Guided Test-Time Transformation (ET3), a lightweight, training-free defense that enhances the robustness by minimizing the energy of the input samples. Our method is grounded in a theory that proves our transformation succeeds in classification under reasonable assumptions. We present extensive experiments demonstrating that ET3 provides a strong defense for classifiers, zero-shot classification with CLIP, and also for boosting the robustness of LVLMs in tasks such as Image Captioning and Visual Question Answering.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2026
			
	Nome convegno
	
				IEEE/CVF Computer Vision and Pattern Recognition
			
	Parole chiave
	
				test-time-transformation, robustness, LLM, EBM
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models / Mirza, Mujtaba Hussain; D'Orazio, Antonio; Melamed, Odelia; Masi, Iacopo. - (2026). ( IEEE/CVF Computer Vision and Pattern Recognition Denver, USA ).

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1763260

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact