Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World

06/14/2022
by   Hammad A. Ayyubi, et al.
0

Understanding how events described or shown in multimedia content relate to one another is a critical component to developing robust artificially intelligent systems which can reason about real-world media. While much research has been devoted to event understanding in the text, image, and video domains, none have explored the complex relations that events experience across domains. For example, a news article may describe a `protest' event while a video shows an `arrest' event. Recognizing that the visual `arrest' event is a subevent of the broader `protest' event is a challenging, yet important problem that prior work has not explored. In this paper, we propose the novel task of MultiModal Event Event Relations to recognize such cross-modal event relations. We contribute a large-scale dataset consisting of 100k video-news article pairs, as well as a benchmark of densely annotated data. We also propose a weakly supervised multimodal method which integrates commonsense knowledge from an external knowledge base (KB) to predict rich multimodal event hierarchies. Experiments show that our model outperforms a number of competitive baselines on our proposed benchmark. We also perform a detailed analysis of our model's performance and suggest directions for future research.

READ FULL TEXT

page 9

page 18

page 21

page 25

page 27

research
09/27/2021

Joint Multimedia Event Extraction from Video and Article

Visual and textual modalities contribute complementary information about...
research
06/15/2023

Training Multimedia Event Extraction With Generated Images and Captions

Contemporary news reporting increasingly features multimedia content, mo...
research
09/15/2019

Query-Focused Scenario Construction

The news coverage of events often contains not one but multiple incompat...
research
04/04/2019

MMED: A Multi-domain and Multi-modality Event Dataset

In this work, we construct and release a multi-domain and multi-modality...
research
03/25/2020

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimo...
research
05/22/2019

Detecting Events of Daily Living Using Multimodal Data

Events are fundamental for understanding how people experience their liv...
research
05/05/2020

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...

Please sign up or login with your details

Forgot password? Click here to reset