Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs

by   Daniel Reich, et al.

With the expressed goal of improving system transparency and visual grounding in the reasoning process in VQA, we present a modular system for the task of compositional VQA based on scene graphs. Our system is called "Adventurer's Treasure Hunt" (or ATH), named after an analogy we draw between our model's search procedure for an answer and an adventurer's search for treasure. We developed ATH with three characteristic features in mind: 1. By design, ATH allows us to explicitly quantify the impact of each of the sub-components on overall VQA performance, as well as their performance on their individual sub-task. 2. By modeling the search task after a treasure hunt, ATH inherently produces an explicit, visually grounded inference path for the processed question. 3. ATH is the first GQA-trained VQA system that dynamically extracts answers by querying the visual knowledge base directly, instead of selecting one from a specially learned classifier's output distribution over a pre-fixed answer vocabulary. We report detailed results on all components and their contributions to overall VQA performance on the GQA dataset and show that ATH achieves the highest visual grounding score among all examined systems.


VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

Visual question answering is a task of predicting the answer to a questi...

Visually Grounded VQA by Lattice-based Retrieval

Visual Grounding (VG) in Visual Question Answering (VQA) systems describ...

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Recent studies have shown that current VQA models are heavily biased on ...

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

The problem of grounding VQA tasks has seen an increased attention in th...

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

There are two main lines of research on visual question answering (VQA):...

Sentence Attention Blocks for Answer Grounding

Answer grounding is the task of locating relevant visual evidence for th...

Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning

Spatial reasoning poses a particular challenge for intelligent agents an...

Please sign up or login with your details

Forgot password? Click here to reset