Detecting and Grounding Multi-Modal Media Manipulation

by   Rui Shao, et al.

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.


page 1

page 4

page 8


Is Multi-Modal Necessarily Better? Robustness Evaluation of Multi-modal Fake News Detection

The proliferation of fake news and its serious negative social influence...

Detecting and Recovering Sequential DeepFake Manipulation

Since photorealistic faces can be readily generated by facial manipulati...

Memotion Analysis through the Lens of Joint Embedding

Joint embedding (JE) is a way to encode multi-modal data into a vector s...

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement

Sarcasm is a linguistic phenomenon indicating a discrepancy between lite...

REX: Reasoning-aware and Grounded Explanation

Effectiveness and interpretability are two essential properties for trus...

Focusing on Relevant Responses for Multi-modal Rumor Detection

In the absence of an authoritative statement about a rumor, people may e...

COVID-VTS: Fact Extraction and Verification on Short Video Platforms

We introduce a new benchmark, COVID-VTS, for fact-checking multi-modal i...

Please sign up or login with your details

Forgot password? Click here to reset