SLIP: Self-supervision meets Language-Image Pre-training

12/23/2021
by   Norman Mu, et al.
12

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training with Vision Transformers, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1 language supervision (+5.2

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2022

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

Model pre-training is a cornerstone of modern visual recognition systems...
research
09/02/2023

Self-Supervised Video Transformers for Isolated Sign Language Recognition

This paper presents an in-depth analysis of various self-supervision met...
research
04/18/2020

Self-Supervised Representation Learning on Document Images

This work analyses the impact of self-supervised pre-training on documen...
research
04/14/2022

DeiT III: Revenge of the ViT

A Vision Transformer (ViT) is a simple neural architecture amenable to s...
research
03/09/2023

Rethinking Self-Supervised Visual Representation Learning in Pre-training for 3D Human Pose and Shape Estimation

Recently, a few self-supervised representation learning (SSL) methods ha...
research
07/19/2022

Self-Supervision Can Be a Good Few-Shot Learner

Existing few-shot learning (FSL) methods rely on training with a large l...
research
11/04/2021

Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports

Pre-training lays the foundation for recent successes in radiograph anal...

Please sign up or login with your details

Forgot password? Click here to reset