MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

by   Revanth Gangi Reddy, et al.

Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.


page 2

page 4

page 6

page 7

page 10

page 11

page 12


VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

The ideal form of Visual Question Answering requires understanding, grou...

Open Question Answering over Tables and Text

In open question answering (QA), the answer to a question is produced by...

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

A multi-hop question answering (QA) dataset aims to test reasoning and i...

Calibrating Trust of Multi-Hop Question Answering Systems with Decompositional Probes

Multi-hop Question Answering (QA) is a challenging task since it require...

Watching the News: Towards VideoQA Models that can Read

Video Question Answering methods focus on commonsense reasoning and visu...

Focal Visual-Text Attention for Memex Question Answering

Recent insights on language and vision with neural networks have been su...

DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog

Harvesting question-answer (QA) pairs from customer service chatlog in t...

Please sign up or login with your details

Forgot password? Click here to reset