Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

by   Yuheng Lu, et al.

Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-vocabulary point-cloud detection. As far as we know, we are the first to study the problem of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OV-3DETIC makes the point-cloud detector capable of achieving open-vocabulary detection. Extensive experiments demonstrate that the proposed OV-3DETIC achieves at least 10.77 improvement (absolute value) and 9.56 wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OV-3DETIC works.


Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

The goal of open-vocabulary detection is to identify novel objects based...

CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding

Manual annotation of large-scale point cloud dataset for varying tasks s...

CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Contrastive Language-Image Pre-training, benefiting from large-scale unl...

Open-Vocabulary Affordance Detection in 3D Point Clouds

Affordance detection is a challenging problem with a wide variety of rob...

Let Images Give You More:Point Cloud Cross-Modal Training for Shape Analysis

Although recent point cloud analysis achieves impressive progress, the p...

ViPFormer: Efficient Vision-and-Pointcloud Transformer for Unsupervised Pointcloud Understanding

Recently, a growing number of work design unsupervised paradigms for poi...

Cross-modal Learning of Graph Representations using Radar Point Cloud for Long-Range Gesture Recognition

Gesture recognition is one of the most intuitive ways of interaction and...

Please sign up or login with your details

Forgot password? Click here to reset