Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb.

CMDD: A novel multimodal two-stream CNN deepfakes detector / Mongelli, L.; Maiano, L.; Amerini, I.. - 3677:(2024), pp. 17-30. (Intervento presentato al convegno 4th Workshop on Reducing Online Misinformation through Credible Information Retrieval, ROMCIR 2024 tenutosi a Glasgow; UK).

CMDD: A novel multimodal two-stream CNN deepfakes detector

Mongelli L.;Maiano L.
;
Amerini I.
2024

Abstract

Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb.
2024
4th Workshop on Reducing Online Misinformation through Credible Information Retrieval, ROMCIR 2024
Deepfake detection; Misinformation; Multimedia Forensics; Multimodal deepfake
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
CMDD: A novel multimodal two-stream CNN deepfakes detector / Mongelli, L.; Maiano, L.; Amerini, I.. - 3677:(2024), pp. 17-30. (Intervento presentato al convegno 4th Workshop on Reducing Online Misinformation through Credible Information Retrieval, ROMCIR 2024 tenutosi a Glasgow; UK).
File allegati a questo prodotto
File Dimensione Formato  
Maiano_CMDD_2024.pdf

accesso aperto

Note: https://ceur-ws.org/Vol-3677/paper2.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 547.09 kB
Formato Adobe PDF
547.09 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1710643
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact