CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

by   Mohamed Afham, et al.

Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object classification, segmentation and detection is often laborious owing to the irregular structure of point clouds. Self-supervised learning, which operates without any human labeling, is a promising approach to address this issue. We observe in the real world that humans are capable of mapping the visual concepts learnt from 2D images to understand the 3D world. Encouraged by this insight, we propose CrossPoint, a simple cross-modal contrastive learning approach to learn transferable 3D point cloud representations. It enables a 3D-2D correspondence of objects by maximizing agreement between point clouds and the corresponding rendered 2D image in the invariant space, while encouraging invariance to transformations in the point cloud modality. Our joint training objective combines the feature correspondences within and across modalities, thus ensembles a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised fashion. Experimental results show that our approach outperforms the previous unsupervised learning methods on a diverse range of downstream tasks including 3D object classification and segmentation. Further, the ablation studies validate the potency of our approach for a better point cloud understanding. Code and pretrained models are available at


PointCMC: Cross-Modal Multi-Scale Correspondences Learning for Point Cloud Understanding

Some self-supervised cross-modal learning approaches have recently demon...

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

Masked Autoencoders (MAE) have shown promising performance in self-super...

CLR-GAM: Contrastive Point Cloud Learning with Guided Augmentation and Feature Mapping

Point cloud data plays an essential role in robotics and self-driving ap...

ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding

Recently, a growing number of work design unsupervised paradigms for poi...

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Current point-cloud detection methods have difficulty detecting the open...

Self-supervised Modal and View Invariant Feature Learning

Most of the existing self-supervised feature learning methods for 3D dat...

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised representation learning is a critical problem in compute...

Please sign up or login with your details

Forgot password? Click here to reset