Referring image segmentation aims to segment an object mentioned in natu...
Due to the limited scale and quality of video-text training corpus, most...
Referring image segmentation aims to segment an object referred to by na...
Large-scale image-text contrastive pre-training models, such as CLIP, ha...
Large pre-trained multimodal models have demonstrated significant succes...
In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
Multimodal representation learning has shown promising improvements on
v...
Compared with image scene parsing, video scene parsing introduces tempor...
Depth information matters in RGB-D semantic segmentation task for provid...
Most image captioning models are autoregressive, i.e. they generate each...