EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

09/29/2022
by   Yanmin Wu, et al.
13

3D visual grounding aims to find the objects within point clouds mentioned by free-form natural language descriptions with rich semantic components. However, existing methods either extract the sentence-level features coupling all words, or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate this issue, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: the textual position alignment and object semantic alignment. On top of that, we further introduce two new visual grounding tasks, locating objects without object names and locating auxiliary objects referenced in the descriptions, both of which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted visual grounding datasets , ScanRefer and SR3D/NR3D, and obtain absolute leadership on our two newly-proposed tasks. The code will be available at https://github.com/yanmin-wu/EDA.

READ FULL TEXT

page 1

page 4

page 6

page 10

page 11

page 12

research
03/30/2021

Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

3D object grounding aims to locate the most relevant target object in a ...
research
07/23/2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Visual Grounding (VG) aims at localizing target objects from an image ba...
research
12/01/2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
research
11/25/2022

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

The 3D visual grounding task has been explored with visual and language ...
research
09/09/2021

Reconstructing and grounding narrated instructional videos in 3D

Narrated instructional videos often show and describe manipulations of s...
research
05/15/2020

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Person search by natural language aims at retrieving a specific person i...
research
11/25/2022

Language-Assisted 3D Feature Learning for Semantic Scene Understanding

Learning descriptive 3D features is crucial for understanding 3D scenes ...

Please sign up or login with your details

Forgot password? Click here to reset