Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

10/21/2022
by   Mitja Nikolaus, et al.
0

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.

READ FULL TEXT

page 1

page 3

page 14

research
06/02/2022

VL-BEiT: Generative Vision-Language Pretraining

We introduce a vision-language foundation model called VL-BEiT, which is...
research
12/09/2021

MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Large-scale pretraining is fast becoming the norm in Vision-Language (VL...
research
04/25/2019

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension

We introduce a set of nine challenge tasks that test for the understandi...
research
01/31/2021

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Recently multimodal transformer models have gained popularity because th...
research
10/12/2022

Foundation Transformers

A big convergence of model architectures across language, vision, speech...
research
05/23/2023

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Recent work in vision-and-language pretraining has investigated supervis...
research
08/31/2018

What do RNN Language Models Learn about Filler-Gap Dependencies?

RNN language models have achieved state-of-the-art perplexity results an...

Please sign up or login with your details

Forgot password? Click here to reset