Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.

Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection / Wani, Taiba Majid; Amerini, Irene. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 46-50. [10.1109/lsp.2025.3634032]

Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection

Wani, Taiba Majid
;
Amerini, Irene
2026

Abstract

Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.
2026
adversarial augmen- tation; Audio deepfake detection; confidence calibration; self-supervised learning
01 Pubblicazione su rivista::01a Articolo in rivista
Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection / Wani, Taiba Majid; Amerini, Irene. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 46-50. [10.1109/lsp.2025.3634032]
File allegati a questo prodotto
File Dimensione Formato  
Wani_Multi-Scale_2026.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.04 MB
Formato Adobe PDF
1.04 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1765779
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact