SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

12/16/2021
by   Zhecan Wang, et al.
4

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

READ FULL TEXT

page 3

page 5

page 10

research
01/07/2020

Bridging Knowledge Graphs to Generate Scene Graphs

Scene graphs are powerful representations that encode images into their ...
research
06/17/2020

Learning Visual Commonsense for Robust Scene Graph Generation

Scene graph generation models understand the scene through object and pr...
research
01/30/2023

Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

A framework performing Visual Commonsense Reasoning(VCR) needs to choose...
research
02/17/2023

CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension

The task of multimodal referring expression comprehension (REC), aiming ...
research
11/10/2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Visual commonsense understanding requires Vision Language (VL) models to...
research
04/17/2022

Attention Mechanism based Cognition-level Scene Understanding

Given a question-image input, the Visual Commonsense Reasoning (VCR) mod...
research
11/10/2015

From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

In this paper we propose the construction of linguistic descriptions of ...

Please sign up or login with your details

Forgot password? Click here to reset