Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

08/24/2023
by   Fei Wang, et al.
0

The multimedia community has shown a significant interest in perceiving and representing the physical world with multimodal pretrained neural network models, and among them, the visual-language pertaining (VLP) is, currently, the most captivating topic. However, there have been few endeavors dedicated to the exploration of 1) whether essential linguistic knowledge (e.g., semantics and syntax) can be extracted during VLP, and 2) how such linguistic knowledge impact or enhance the multimodal alignment. In response, here we aim to elucidate the impact of comprehensive linguistic knowledge, including semantic expression and syntactic structure, on multimodal alignment. Specifically, we design and release the SNARE, the first large-scale multimodal alignment probing benchmark, to detect the vital linguistic components, e.g., lexical, semantic, and syntax knowledge, containing four tasks: Semantic structure, Negation logic, Attribute ownership, and Relationship composition. Based on our proposed probing benchmarks, our holistic analyses of five advanced VLP models illustrate that the VLP model: i) shows insensitivity towards complex syntax structures and relies on content words for sentence comprehension; ii) demonstrates limited comprehension of combinations between sentences and negations; iii) faces challenges in determining the presence of actions or spatial relationships within visual information and struggles with verifying the correctness of triple combinations. We make our benchmark and code available at <https://github.com/WangFei-2019/SNARE/>.

READ FULL TEXT

page 1

page 2

research
03/17/2022

Finding Structural Knowledge in Multimodal-BERT

In this work, we investigate the knowledge learned in the embeddings of ...
research
09/21/2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?

Linguistic representations derived from text alone have been criticized ...
research
07/26/2020

Contrastive Visual-Linguistic Pretraining

Several multi-modality representation learning approaches such as LXMERT...
research
09/10/2021

Predicting emergent linguistic compositions through time: Syntactic frame extension via multimodal chaining

Natural language relies on a finite lexicon to express an unbounded set ...
research
06/18/2017

Lexical representation explains cortical entrainment during speech comprehension

Results from a recent neuroimaging study on spoken sentence comprehensio...
research
08/07/2023

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

Recent advancements in Large Vision-Language Models (LVLMs) have demonst...
research
01/24/2022

A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using Literals

Scholarly data is growing continuously containing information about the ...

Please sign up or login with your details

Forgot password? Click here to reset