Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

by   Xiaonan Lu, et al.

Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address these problems, we propose a Viewpoint Integration and Registration method. Concretely, we introduce a fused adapter image encoder that fine-tunes pre-trained encoders by inserting designed trainable adapters and fused adapters, to effectively capture nuances between image pairs. Additionally, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the performance degradation caused by viewpoint variations in the visual and semantic space, respectively. Experimental results on CLEVR-Change and Spot-the-Diff demonstrate that our method achieves state-of-the-art performance in all metrics.


page 1

page 3

page 4

page 6

page 7

page 11

page 12


Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Foundation models or pre-trained models have substantially improved the ...

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Change Captioning is a task that aims to describe the difference between...

R^3Net:Relation-embedded Representation Reconstruction Network for Change Captioning

Change captioning is to use a natural language sentence to describe the ...

Viewpoint Invariant Change Captioning

The ability to detect that something has changed in an environment is va...

L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

Recent advances in language and vision push forward the research of capt...

SA-DNet: A on-demand semantic object registration network adapting to non-rigid deformation

As an essential processing step before the fusing of infrared and visibl...

Perceptual Loss for Robust Unsupervised Homography Estimation

Homography estimation is often an indispensable step in many computer vi...

Please sign up or login with your details

Forgot password? Click here to reset