Visual Commonsense R-CNN

by   Tan Wang, et al.
Singapore Management University
Nanyang Technological University

We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat — while not just "common" co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across all the methods and tasks, achieving many new state-of-the-arts. Code and feature are available at


page 5

page 8

page 15

page 16

page 17

page 18

page 19

page 20


Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image...

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Humans tend to decompose a sentence into different parts like sth do sth...

In Defense of Grid Features for Visual Question Answering

Popularized as 'bottom-up' attention, bounding box (or region) based vis...

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

Current state-of-the-art methods for image captioning employ region-base...

A-CAP: Anticipation Captioning with Commonsense Knowledge

Humans possess the capacity to reason about the future based on a sparse...

Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, ...

Image Captioning In the Transformer Age

Image Captioning (IC) has achieved astonishing developments by incorpora...

Code Repositories


[CVPR 2020] The official pytorch implementation of ``Visual Commonsense R-CNN''

view repo


Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Python3 | PyTorch | CNNs | Causality | Reasoning | LSTMs | Transformers | Multi-Head Self Attention | Published in IEEE Winter Conference on Applications of Computer Vision (WACV) 2021

view repo

Please sign up or login with your details

Forgot password? Click here to reset