MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Perception

by   Gyan Tatiya, et al.

A holistic understanding of object properties across diverse sensory modalities (e.g., visual, audio, and haptic) is essential for tasks ranging from object categorization to complex manipulation. Drawing inspiration from cognitive science studies that emphasize the significance of multi-sensory integration in human perception, we introduce MOSAIC (Multi-modal Object property learning with Self-Attention and Integrated Comprehension), a novel framework designed to facilitate the learning of unified multi-sensory object property representations. While it is undeniable that visual information plays a prominent role, we acknowledge that many fundamental object properties extend beyond the visual domain to encompass attributes like texture, mass distribution, or sounds, which significantly influence how we interact with objects. In MOSAIC, we leverage this profound insight by distilling knowledge from the extensive pre-trained Contrastive Language-Image Pre-training (CLIP) model, aligning these representations not only across vision but also haptic and auditory sensory modalities. Through extensive experiments on a dataset where a humanoid robot interacts with 100 objects across 10 exploratory behaviors, we demonstrate the versatility of MOSAIC in two task families: object categorization and object-fetching tasks. Our results underscore the efficacy of MOSAIC's unified representations, showing competitive performance in category recognition through a simple linear probe setup and excelling in the fetch object task under zero-shot transfer conditions. This work pioneers the application of CLIP-based sensory grounding in robotics, promising a significant leap in multi-sensory perception capabilities for autonomous systems. We have released the code, datasets, and additional results:


page 1

page 2

page 3


Transferring Implicit Knowledge of Non-Visual Object Properties Across Heterogeneous Robot Morphologies

Humans leverage multiple sensor modalities when interacting with objects...

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Masked Autoencoders (MAE) play a pivotal role in learning potent represe...

Evaluating the Apperception Engine

The Apperception Engine is an unsupervised learning system. Given a sequ...

Cross-Tool and Cross-Behavior Perceptual Knowledge Transfer for Grounded Object Recognition

Humans learn about objects via interaction and using multiple perception...

Task-relevant Representation Learning for Networked Robotic Perception

Today, even the most compute-and-power constrained robots can measure co...

Multi-sensory Integration in a Quantum-Like Robot Perception Model

Formalisms inspired by Quantum theory have been used in Cognitive Scienc...

From Multi-modal Property Dataset to Robot-centric Conceptual Knowledge About Household Objects

Tool-use applications in robotics require conceptual knowledge about obj...

Please sign up or login with your details

Forgot password? Click here to reset