The increasing prevalence of audio deepfakes has raised serious concerns due to their potential misuse in identity theft, disinformation, and the compromise of voice authentication systems. Detecting these manipulations requires models capable of handling a wide range of audio features and attack strategies. In this paper, we introduce HCN-TA (Hierarchical Capsule Network with Temporal Attention), a novel architecture specifically designed for scalable and generalizable audio deepfake detection. The hierarchical capsule networks capture local and global audio patterns, while the multi-resolution temporal attention focuses on key segments with likely deepfake artifacts. Temporal locality awareness ensures prioritization of critical, rapidly changing regions. We validate the effectiveness of HCN-TA on the ASVspoof 2019 (LA) and FoR datasets, achieving low equal error rates (EER%) of 0.42% and 0.11% respectively.

HCN-TA: Hierarchical Capsule Network with Temporal Attention for a Generalizable Approach to Audio Deepfake Detection / Wani, T. M.; Uecker, M.; Wani, F. A.; Amerini, I.. - (2025), pp. 775-777. ( 40th Annual ACM Symposium on Applied Computing, SAC 2025 Sicily, Italy ) [10.1145/3672608.3707761].

HCN-TA: Hierarchical Capsule Network with Temporal Attention for a Generalizable Approach to Audio Deepfake Detection

Wani T. M.
Primo
;
Wani F. A.
;
Amerini I.
2025

Abstract

The increasing prevalence of audio deepfakes has raised serious concerns due to their potential misuse in identity theft, disinformation, and the compromise of voice authentication systems. Detecting these manipulations requires models capable of handling a wide range of audio features and attack strategies. In this paper, we introduce HCN-TA (Hierarchical Capsule Network with Temporal Attention), a novel architecture specifically designed for scalable and generalizable audio deepfake detection. The hierarchical capsule networks capture local and global audio patterns, while the multi-resolution temporal attention focuses on key segments with likely deepfake artifacts. Temporal locality awareness ensures prioritization of critical, rapidly changing regions. We validate the effectiveness of HCN-TA on the ASVspoof 2019 (LA) and FoR datasets, achieving low equal error rates (EER%) of 0.42% and 0.11% respectively.
2025
40th Annual ACM Symposium on Applied Computing, SAC 2025
audio deepfake; capsule network; temporal attention; generalization; fake detection
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
HCN-TA: Hierarchical Capsule Network with Temporal Attention for a Generalizable Approach to Audio Deepfake Detection / Wani, T. M.; Uecker, M.; Wani, F. A.; Amerini, I.. - (2025), pp. 775-777. ( 40th Annual ACM Symposium on Applied Computing, SAC 2025 Sicily, Italy ) [10.1145/3672608.3707761].
File allegati a questo prodotto
File Dimensione Formato  
Wani_HCN-TA_2025.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.22 MB
Formato Adobe PDF
1.22 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1741570
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact