VD-BERT: A Unified Vision and Dialog Transformer with BERT

04/28/2020
by   Shafiq Joty, et al.
6

Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the leaderboard of visual dialog benchmark. We release the code and pretrained models to replicate the results from this paper at https://github.com/yuewang-cuhk/VD-BERT.

READ FULL TEXT

page 9

page 14

page 15

research
03/06/2022

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a...
research
12/05/2019

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Prior work in visual dialog has focused on training deep neural models o...
research
11/23/2022

Unified Multimodal Model with Unlikelihood Training for Visual Dialog

The task of visual dialog requires a multimodal chatbot to answer sequen...
research
10/28/2019

A Simple but Effective BERT Model for Dialog State Tracking on Resource-Limited Systems

In a task-oriented dialog system, the goal of dialog state tracking (DST...
research
11/26/2019

Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog

It has been a primary concern in recent studies of vision and language t...
research
08/22/2023

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

This report details the methods of the winning entry of the AVDN Challen...
research
05/24/2021

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

GuessWhat?! is a two-player visual dialog guessing game where player A a...

Please sign up or login with your details

Forgot password? Click here to reset