Catalogo dei prodotti della ricerca

The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.

Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection / Wani, Taiba Majid; Amerini, Irene. - In: DISCOVER COMPUTING. - ISSN 2948-2992. - 28:1(2025). [10.1007/s10791-025-09746-4]

Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection

Wani, Taiba Majid;Amerini, Irene

2025

Abstract

The rapid evolution of audio deepfakes has raised significant challenges for the security and reliability of voice-driven systems. While recent detection frameworks achieve high accuracy in controlled environments, their performance often degrades under real-world conditions involving codec compression, signal preprocessing, or domain shifts. To address these challenges, we propose dynamic knowledge condensation with audio-selective transformer (DK-CAST), a novel tri-stream knowledge distillation framework designed for robust audio deepfake detection. DK-CAST employs a high-capacity XLS-R teacher trained on clean speech to supervise a compact student model operating on degraded and preprocessed audio. The student employs a custom audio-selective transformer with dual-stream encoding, dynamic fusion, and phoneme-gated attention to emphasize linguistically relevant cues. Knowledge is transferred via multi-level supervision, including logits, embeddings, and phoneme posteriors, and modulated through a codec-aware loss weighting scheme. To enhance generalization, DK-CAST also includes a compression-agnostic embedding alignment module based on MMD and Center Loss. Evaluations on ASVspoof 2019-LA and ASVspoof 2021-DF demonstrate state-of-the-art performance, achieving EERs of 0.38 and 2.18%, respectively. Furthermore, DK-CAST maintains strong performance under codec degradation, achieving an EER of 3.01% on ASVspoof 2021-DF when tested under MP3 compression.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Parole chiave
	
				Audio compression; Audio deepfake detection; Codec-aware learning; Knowledge distillation; Phoneme-gated attention
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				Dynamic knowledge condensation with audio-selective transformer for audio deepfake detection / Wani, Taiba Majid; Amerini, Irene. - In: DISCOVER COMPUTING. - ISSN 2948-2992. - 28:1(2025). [10.1007/s10791-025-09746-4]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Wani_Dynamic-knowledge_2025.pdf accesso aperto Note: https://link.springer.com/content/pdf/10.1007/s10791-025-09746-4.pdf Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 2.85 MB Formato Adobe PDF	2.85 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1765784

Citazioni

ND

0

0

social impact