Advances in audio synthesis techniques have led to the creation of highly realistic audio deepfakes, posing growing threats to digital integrity and public trust. These synthetic manipulations mimic natural speech with high fidelity, making detection increasingly challenging and fueling the spread of misinformation, identity fraud, and voice-based attacks. To address these concerns, this study proposes the Adaptive Spectro-Temporal Diffusion Transformer (ASTDT), a novel detection framework that tackles key challenges in generalization, interpretability, and adaptability across diverse audio generation techniques. ASTDT integrates a score-based diffusion model to augment training spectrograms with realistic deepfake variations, improving generalization to unseen text-to-speech and voice conversion attacks. An adaptive spectro-temporal feature extraction mechanism partitions audio into interpretable frequency and temporal segments, while a dual-modal attention fusion module jointly processes magnitude and phase features. These fused features are processed by a transformer encoder with diffusion-aware attention, enabling effective modeling of long-range temporal dependencies. To enhance transparency, ASTDT includes an interpretability module that combines quantitative feature attributions and spatial heatmaps to explain model predictions. Experimental results across four benchmark datasets demonstrate the effectiveness of ASTDT, with the model achieving the lowest equal error rate of 1.20% on the ASVspoof 2019 dataset.
ASTDT: an Interpretable Adaptive Spectro-Temporal Diffusion Transformer for audio deepfake detection / Wani, Taiba Majid; Qadri, Syed Asif Ahmad; Ashraf, Arselan; Amerini, Irene. - In: EURASIP JOURNAL ON INFORMATION SECURITY. - ISSN 2510-523X. - 2025:1(2025). [10.1186/s13635-025-00217-3]
ASTDT: an Interpretable Adaptive Spectro-Temporal Diffusion Transformer for audio deepfake detection
Wani, Taiba Majid
;Amerini, Irene
2025
Abstract
Advances in audio synthesis techniques have led to the creation of highly realistic audio deepfakes, posing growing threats to digital integrity and public trust. These synthetic manipulations mimic natural speech with high fidelity, making detection increasingly challenging and fueling the spread of misinformation, identity fraud, and voice-based attacks. To address these concerns, this study proposes the Adaptive Spectro-Temporal Diffusion Transformer (ASTDT), a novel detection framework that tackles key challenges in generalization, interpretability, and adaptability across diverse audio generation techniques. ASTDT integrates a score-based diffusion model to augment training spectrograms with realistic deepfake variations, improving generalization to unseen text-to-speech and voice conversion attacks. An adaptive spectro-temporal feature extraction mechanism partitions audio into interpretable frequency and temporal segments, while a dual-modal attention fusion module jointly processes magnitude and phase features. These fused features are processed by a transformer encoder with diffusion-aware attention, enabling effective modeling of long-range temporal dependencies. To enhance transparency, ASTDT includes an interpretability module that combines quantitative feature attributions and spatial heatmaps to explain model predictions. Experimental results across four benchmark datasets demonstrate the effectiveness of ASTDT, with the model achieving the lowest equal error rate of 1.20% on the ASVspoof 2019 dataset.| File | Dimensione | Formato | |
|---|---|---|---|
|
Wani_ASTDT_2025.pdf
accesso aperto
Note: https://jis-eurasipjournals.springeropen.com/counter/pdf/10.1186/s13635-025-00217-3
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
5.36 MB
Formato
Adobe PDF
|
5.36 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


