PointCLIP: Point Cloud Understanding by CLIP

12/04/2021
by   Renrui Zhang, et al.
9

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts. Specifically, we encode a point cloud by projecting it into multi-view depth maps without rendering, and aggregate the view-wise zero-shot prediction to achieve knowledge transfer from 2D to 3D. On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D. By just fine-tuning the lightweight adapter in the few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the complementary property between PointCLIP and classical 3D-supervised networks. By simple ensembling, PointCLIP boosts baseline's performance and even surpasses state-of-the-art models. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime. We conduct thorough experiments on widely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN to demonstrate the effectiveness of PointCLIP. The code is released at https://github.com/ZrrSkywalker/PointCLIP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2022

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

Pre-training across 3D vision and language remains under development bec...
research
04/20/2023

Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?

Point cloud based 3D deep model has wide applications in many applicatio...
research
12/04/2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...
research
06/21/2023

NeuroCLIP: Neuromorphic Data Understanding by CLIP and SNN

Recently, the neuromorphic vision sensor has received more and more inte...
research
01/12/2023

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

Contrastive language-image pre-training (CLIP) achieves promising result...
research
08/06/2023

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

In recent years, 3D representation learning has turned to 2D vision-lang...
research
12/13/2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Pre-training by numerous image data has become de-facto for robust 2D re...

Please sign up or login with your details

Forgot password? Click here to reset