The capabilities of generative models to produce high-quality fake images require deepfake detectors to be accurate and have strong generalization performance. Moreover, the explainability and adversarial robustness of deepfake detectors are critical to apply such models in real-world scenarios. In this paper, we propose a framework that leverages explainability to assess the adversarial robustness of deepfake detectors. Specifically, we apply feature attribution methods to identify image regions where the model is focusing to make its prediction. Then we use the generated heatmaps to perform an explainability-driven attack, perturbing the most relevant and irrelevant regions with gradient-based adversarial techniques. We feed the model with the resulting adversarial images and measure the accuracy drop and the attack success rate. We tested our methodology on state-of-the-art models with strong generalization abilities, providing a comprehensive and explainability-driven evaluation of their robustness. Experimental results show the explainability analysis serves as a tool to reveal vulnerabilities of generalized deepfake detectors to adversarial attacks.

Explainability-driven adversarial robustness assessment for generalized deepfake detectors / Cirillo, Lorenzo; Gervasio, Andrea; Amerini, Irene. - In: EURASIP JOURNAL ON INFORMATION SECURITY. - ISSN 2510-523X. - 2025:1(2025). [10.1186/s13635-025-00211-9]

Explainability-driven adversarial robustness assessment for generalized deepfake detectors

Cirillo, Lorenzo
;
Amerini, Irene
2025

Abstract

The capabilities of generative models to produce high-quality fake images require deepfake detectors to be accurate and have strong generalization performance. Moreover, the explainability and adversarial robustness of deepfake detectors are critical to apply such models in real-world scenarios. In this paper, we propose a framework that leverages explainability to assess the adversarial robustness of deepfake detectors. Specifically, we apply feature attribution methods to identify image regions where the model is focusing to make its prediction. Then we use the generated heatmaps to perform an explainability-driven attack, perturbing the most relevant and irrelevant regions with gradient-based adversarial techniques. We feed the model with the resulting adversarial images and measure the accuracy drop and the attack success rate. We tested our methodology on state-of-the-art models with strong generalization abilities, providing a comprehensive and explainability-driven evaluation of their robustness. Experimental results show the explainability analysis serves as a tool to reveal vulnerabilities of generalized deepfake detectors to adversarial attacks.
2025
Deepfake detection; Model explainability; Adversarial robustness; Generalized deepfake detectors; Explainability-driven attack
01 Pubblicazione su rivista::01a Articolo in rivista
Explainability-driven adversarial robustness assessment for generalized deepfake detectors / Cirillo, Lorenzo; Gervasio, Andrea; Amerini, Irene. - In: EURASIP JOURNAL ON INFORMATION SECURITY. - ISSN 2510-523X. - 2025:1(2025). [10.1186/s13635-025-00211-9]
File allegati a questo prodotto
File Dimensione Formato  
Cirillo_Explainability-driven_2025.pdf

accesso aperto

Note: DOI10.1186/s13635-025-00211-9
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 2.91 MB
Formato Adobe PDF
2.91 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1744536
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact