X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

by   Yinan He, et al.

In computer vision, pre-training models based on largescale supervised learning have been proven effective over the past few years. However, existing works mostly focus on learning from individual task with single data source (e.g., ImageNet for classification or COCO for detection). This restricted form limits their generalizability and usability due to the lack of vast semantic information from various tasks and data sources. Here, we demonstrate that jointly learning from heterogeneous tasks and multiple data sources contributes to universal visual representation, leading to better transferring results of various downstream tasks. Thus, learning how to bridge the gaps among different tasks and data sources is the key, but it still remains an open question. In this work, we propose a representation learning framework called X-Learner, which learns the universal feature of multiple vision tasks supervised by various sources, with expansion and squeeze stage: 1) Expansion Stage: X-Learner learns the task-specific feature to alleviate task interference and enrich the representation by reconciliation layer. 2) Squeeze Stage: X-Learner condenses the model to a reasonable size and learns the universal and generalizable representation for various tasks transferring. Extensive experiments demonstrate that X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs compared to existing representation learning methods. Notably, a single X-Learner model shows remarkable gains of 3.0 12 downstream datasets for classification, object detection and semantic segmentation.


page 1

page 2

page 3

page 4


DATA: Domain-Aware and Task-Aware Pre-training

The paradigm of training models on massive data without label through se...

One to Transfer All: A Universal Transfer Framework for Vision Foundation Model with Few Data

The foundation model is not the last chapter of the model production pip...

Joint Representation Learning for Text and 3D Point Cloud

Recent advancements in vision-language pre-training (e.g. CLIP) have sho...

G5: A Universal GRAPH-BERT for Graph-to-Graph Transfer and Apocalypse Learning

The recent GRAPH-BERT model introduces a new approach to learning graph ...

Amortised Invariance Learning for Contrastive Self-Supervision

Contrastive self-supervised learning methods famously produce high quali...

VisualEchoes: Spatial Image Representation Learning through Echolocation

Several animal species (e.g., bats, dolphins, and whales) and even visua...

Task-Driven Common Representation Learning via Bridge Neural Network

This paper introduces a novel deep learning based method, named bridge n...

Please sign up or login with your details

Forgot password? Click here to reset