The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.

Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection / Wani, Taiba Majid; Amerini, Irene. - In: DISCOVER COMPUTING. - ISSN 2948-2992. - 28:1(2025). [10.1007/s10791-025-09746-4]

Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection

Wani, Taiba Majid
;
Amerini, Irene
2025

Abstract

The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.
2025
Audio compression; Audio deepfake detection; Codec-aware learning; Knowledge distillation; Phoneme-gated attention
01 Pubblicazione su rivista::01a Articolo in rivista
Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection / Wani, Taiba Majid; Amerini, Irene. - In: DISCOVER COMPUTING. - ISSN 2948-2992. - 28:1(2025). [10.1007/s10791-025-09746-4]
File allegati a questo prodotto
File Dimensione Formato  
Wani_Dynamic-knowledge_2025.pdf

accesso aperto

Note: https://link.springer.com/content/pdf/10.1007/s10791-025-09746-4.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 2.85 MB
Formato Adobe PDF
2.85 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1765784
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact