Deepfakes aim to deceive the viewer to believe that synthesized media, in which one person's characteristics are mapped onto another actor, is genuine. Unfortunately, the malicious uses of deepfakes will likely dominate the legitimate uses. In this report, we present a multimodal deep learning solution to detect deepfakes by using audio-visual modalities and their various fusion options. We postulate that the generation of deepfakes will occasionally disrupt the audio-visual portrayal of emotion. Using reasoning grounded in psychology and affective computing, we select relevant features that provide rich information regarding sentiment and train a CNN Bi-LSTM based classifier with an early additive fusion of these modalities. We perform experiments on the DFDC and DeepFake-TIMIT datasets, investigate the correlation between modalities, and discover which features provide the most information towards classification. We show that our approach outperforms the state-of-the-art by up to 5.6\%.
|Effective start/end date
|3/05/21 → 27/08/21