Semi-supervised Multimodal Representation Learning through a Global Workspace

06/27/2023
by   Benjamin Devillers, et al.
0

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.

READ FULL TEXT
research
08/03/2023

Multimodal Neurons in Pretrained Text-Only Transformers

Language models demonstrate remarkable capacity to generalize representa...
research
06/29/2020

Self-Supervised MultiModal Versatile Networks

Videos are a rich source of multi-modal supervision. In this work, we le...
research
06/23/2021

Learning Multimodal VAEs through Mutual Supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous ...
research
03/08/2023

Comparing Trajectory and Vision Modalities for Verb Representation

Three-dimensional trajectories, or the 3D position and rotation of objec...
research
06/03/2017

See, Hear, and Read: Deep Aligned Representations

We capitalize on large amounts of readily-available, synchronous data to...
research
12/09/2021

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrela...
research
01/07/2021

Learning Temporal Dynamics from Cycles in Narrated Video

Learning to model how the world changes as time elapses has proven a cha...

Please sign up or login with your details

Forgot password? Click here to reset