Exploring Effective Factors for Improving Visual In-Context Learning

by   Yanpeng Sun, et al.

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. Prompt selection is the process of identifying the most appropriate prompt or example to help the model understand new tasks. This is important because providing the model with relevant prompts can help it learn more effectively and efficiently. Prompt fusion involves combining knowledge from different positions within the large-scale visual model. By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks. Based these findings, we propose a simple framework prompt-SelF for visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate all the knowledge stored in the large-scale model, and finally ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. And we conduct extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, the prompt-SelF has outperformed OSLSM based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at <https://github.com/syp2ysy/prompt-SelF>.


page 1

page 2

page 6

page 8

page 12

page 13

page 14

page 15


Self-adaptive In-context Learning

Despite the surprising few-shot performance of in-context learning (ICL)...

Meta-Learning with Self-Improving Momentum Target

The idea of using a separately trained target model (or teacher) to impr...

Multitask Vision-Language Prompt Tuning

Prompt Tuning, conditioning on task-specific learned prompt vectors, has...

What Makes Good Examples for Visual In-Context Learning?

Large-scale models trained on broad data have recently become the mainst...

A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion

Image fusion plays a key role in a variety of multi-sensor-based vision ...

Asymmetric 3D Context Fusion for Universal Lesion Detection

Modeling 3D context is essential for high-performance 3D medical image a...

Learning to Parse Wireframes in Images of Man-Made Environments

In this paper, we propose a learning-based approach to the task of autom...

Please sign up or login with your details

Forgot password? Click here to reset