Catalogo dei prodotti della ricerca

Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.

Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection / Wani, Taiba Majid; Amerini, Irene. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 46-50. [10.1109/lsp.2025.3634032]

Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection

Wani, Taiba Majid;Amerini, Irene

2026

Abstract

Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2026
			
	Parole chiave
	
				adversarial augmen- tation; Audio deepfake detection; confidence calibration; self-supervised learning
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection / Wani, Taiba Majid; Amerini, Irene. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 46-50. [10.1109/lsp.2025.3634032]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Wani_Multi-Scale_2026.pdf solo gestori archivio Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.04 MB Formato Adobe PDF Contatta l'autore	1.04 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1765779

Citazioni

ND

0

0

social impact