Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb.
CMDD: A novel multimodal two-stream CNN deepfakes detector / Mongelli, L.; Maiano, L.; Amerini, I.. - 3677:(2024), pp. 17-30. (Intervento presentato al convegno 4th Workshop on Reducing Online Misinformation through Credible Information Retrieval, ROMCIR 2024 tenutosi a Glasgow; UK).
CMDD: A novel multimodal two-stream CNN deepfakes detector
Mongelli L.;Maiano L.
;Amerini I.
2024
Abstract
Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb.File | Dimensione | Formato | |
---|---|---|---|
Maiano_CMDD_2024.pdf
accesso aperto
Note: https://ceur-ws.org/Vol-3677/paper2.pdf
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
547.09 kB
Formato
Adobe PDF
|
547.09 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.