Self-supervised learning of visual features through embedding images into text topic spaces

05/24/2017
by   Lluis Gómez, et al.
0

End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.

READ FULL TEXT

page 2

page 4

page 6

page 8

research
07/04/2018

TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

The immense success of deep learning based methods in computer vision he...
research
01/31/2019

Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last y...
research
06/08/2021

Interpretable agent communication from scratch (with a generic visual processor emerging on the side)

As deep networks begin to be deployed as autonomous agents, the issue of...
research
10/13/2020

Audio-Visual Self-Supervised Terrain Type Discovery for Mobile Platforms

The ability to both recognize and discover terrain characteristics is an...
research
04/03/2023

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a...
research
04/14/2023

DINOv2: Learning Robust Visual Features without Supervision

The recent breakthroughs in natural language processing for model pretra...
research
07/16/2022

Multi-Modal Unsupervised Pre-Training for Surgical Operating Room Workflow Analysis

Data-driven approaches to assist operating room (OR) workflow analysis d...

Please sign up or login with your details

Forgot password? Click here to reset