Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.
Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection / Wani, Taiba Majid; Amerini, Irene. - In: IEEE SIGNAL PROCESSING LETTERS. - ISSN 1070-9908. - 33:(2026), pp. 46-50. [10.1109/lsp.2025.3634032]
Multi-Scale Self-Supervised Learning for Efficient Audio Deepfake Detection
Wani, Taiba Majid
;Amerini, Irene
2026
Abstract
Neural vocoders enable highly realistic synthetic speech that challenges multimedia authentication; however, existing detection approaches suffer from limited robustness to unseen synthesis methods and inadequate deployment readiness. We propose MASD (Multi-scale Artifact-aware Self-supervised Deepfake detector), combining multi-scale SSL with handcrafted features.MASD decomposes spectrograms into three frequency bands, processed through an encoder pretrained using masked reconstruction, contrastive predictive coding, and adversarial vocoder classification. Features fuse with phase coherence, spectral flux, and high-frequency energy through cross-attention, classified by temperature-scaled SVM. Evaluation on ASVspoof 2019 LA demonstrates state-of-the-art performance. Ablation studies confirm adversarial augmentation as the primary driver of robustness, improving EER from 1.52% to 0.39%, while zero-shot cross-dataset evaluation validates generalization effectiveness, establishing MASD as a practical solution.| File | Dimensione | Formato | |
|---|---|---|---|
|
Wani_Multi-Scale_2026.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.04 MB
Formato
Adobe PDF
|
1.04 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


