The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.
Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection / Wani, Taiba Majid; Amerini, Irene. - In: DISCOVER COMPUTING. - ISSN 2948-2992. - 28:1(2025). [10.1007/s10791-025-09746-4]
Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection
Wani, Taiba Majid
;Amerini, Irene
2025
Abstract
The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.| File | Dimensione | Formato | |
|---|---|---|---|
|
Wani_Dynamic-knowledge_2025.pdf
accesso aperto
Note: https://link.springer.com/content/pdf/10.1007/s10791-025-09746-4.pdf
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
2.85 MB
Formato
Adobe PDF
|
2.85 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


