Audio deepfake detection is emerging as a crucial field in digital media, as distinguishing real audio from deepfakes becomes increasingly challenging due to the advancement of deepfake technologies. These methods threaten information authenticity and pose serious security risks. Addressing this challenge, we propose a novel architecture that combines Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) for effective deepfake audio detection. Our approach is distinguished by the feature concatenation of a comprehensive set of acoustic features: Mel Frequency Cepstral Coefficients (MFCC), Mel spectrograms, Constant Q Cepstral Coefficients (CQCC), and Constant-Q Transform (CQT) vectors. In the proposed architecture, features processed by a CNN are concatenated into two multi-dimensional features for comprehensive analysis, then analyzed by a BiLSTM network to capture temporal dynamics and contextual dependencies in audio data. This synergistic method ensures an understanding of both spatial and sequential audio characteristics. We validate our model on the ASVSpoof 2019 and FoR datasets, using accuracy and Equal Error Rate (EER) metrics for the evaluation.
Detecting audio deepfakes: integrating CNN and BiLSTM with multi-feature concatenation / TAIBA MAJID, TAIBA MAJID; Qadri, Syed Asif Ahmad; Comminiello, Danilo; Amerini, Irene. - (2024), pp. 271-276. (Intervento presentato al convegno IH&MMSec '24: Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security tenutosi a Baiona, Spain) [10.1145/3658664.3659647].
Detecting audio deepfakes: integrating CNN and BiLSTM with multi-feature concatenation
Wani, Taiba Majid
Primo
Writing – Original Draft Preparation
;Comminiello, Danilo;Amerini, Irene
Secondo
Supervision
2024
Abstract
Audio deepfake detection is emerging as a crucial field in digital media, as distinguishing real audio from deepfakes becomes increasingly challenging due to the advancement of deepfake technologies. These methods threaten information authenticity and pose serious security risks. Addressing this challenge, we propose a novel architecture that combines Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) for effective deepfake audio detection. Our approach is distinguished by the feature concatenation of a comprehensive set of acoustic features: Mel Frequency Cepstral Coefficients (MFCC), Mel spectrograms, Constant Q Cepstral Coefficients (CQCC), and Constant-Q Transform (CQT) vectors. In the proposed architecture, features processed by a CNN are concatenated into two multi-dimensional features for comprehensive analysis, then analyzed by a BiLSTM network to capture temporal dynamics and contextual dependencies in audio data. This synergistic method ensures an understanding of both spatial and sequential audio characteristics. We validate our model on the ASVSpoof 2019 and FoR datasets, using accuracy and Equal Error Rate (EER) metrics for the evaluation.File | Dimensione | Formato | |
---|---|---|---|
Majid-Wani_Detecting_2024.pdf
accesso aperto
Tipologia:
Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza:
Creative commons
Dimensione
1.34 MB
Formato
Adobe PDF
|
1.34 MB | Adobe PDF | |
Majid-Wani_Indice_Detecting_2024.pdf
solo gestori archivio
Note: Indice
Tipologia:
Altro materiale allegato
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
2.71 MB
Formato
Adobe PDF
|
2.71 MB | Adobe PDF | Contatta l'autore |
Majid-Wani_Frontespizio_Dectecting_2024.pdf
solo gestori archivio
Note: Frontespizio
Tipologia:
Altro materiale allegato
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.87 MB
Formato
Adobe PDF
|
1.87 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.