OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

by   Zhening Huang, et al.

Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. Project page:


page 14

page 17

page 19

page 20

page 21

page 22

page 23

page 24


OpenMask3D: Open-Vocabulary 3D Instance Segmentation

We introduce the task of open-vocabulary 3D instance segmentation. Tradi...

Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations

Existing instance segmentation models learn task-specific information us...

Open-Vocabulary Panoptic Segmentation with MaskCLIP

In this paper, we tackle a new computer vision task, open-vocabulary pan...

Instance Neural Radiance Field

This paper presents one of the first learning-based NeRF 3D instance seg...

LERF: Language Embedded Radiance Fields

Humans describe the physical world using natural language to refer to sp...

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

We study open-world 3D scene understanding, a family of tasks that requi...

Learning Segmentation Masks with the Independence Prior

An instance with a bad mask might make a composite image that uses it lo...

Please sign up or login with your details

Forgot password? Click here to reset