CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

by   Yihan Zeng, et al.

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.


page 1

page 3

page 4

page 8

page 12

page 14

page 15


Joint Representation Learning for Text and 3D Point Cloud

Recent advancements in vision-language pre-training (e.g. CLIP) have sho...

P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Self-supervised representation learning is a critical problem in compute...

Frozen CLIP Model is An Efficient Point Cloud Backbone

The pretraining-finetuning paradigm has demonstrated great success in NL...

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

We introduce OpenShape, a method for learning multi-modal joint represen...

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Current point-cloud detection methods have difficulty detecting the open...

TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions

Recently, more and more people study online for the convenience of acces...

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

Despite the rapid advancement of unsupervised learning in visual represe...

Please sign up or login with your details

Forgot password? Click here to reset