Learning Visual Affordance Grounding from Demonstration Videos

08/12/2021
by   Hongchen Luo, et al.
0

Visual affordance grounding aims to segment all possible interaction regions between people and objects from an image/video, which is beneficial for many applications, such as robot grasping and action recognition. However, existing methods mainly rely on the appearance feature of the objects to segment each region of the image, which face the following two problems: (i) there are multiple possible regions in an object that people interact with; and (ii) there are multiple possible human interactions in the same object region. To address these problems, we propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos to eliminate the multiple possibilities and better locate the interaction regions in the object. Specifically, HAG-Net has a dual-branch structure to process the demonstration video and object image. For the video branch, we introduce hand-aided attention to enhance the region around the hand in each video frame and then use the LSTM network to aggregate the action features. For the object branch, we introduce a semantic enhancement module (SEM) to make the network focus on different parts of the object according to the action classes and utilize a distillation loss to align the output features of the object branch with that of the video branch and transfer the knowledge in the video branch to the object branch. Quantitative and qualitative evaluations on two challenging datasets show that our method has achieved stateof-the-art results for affordance grounding. The source code will be made available to the public.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 8

page 9

page 10

page 11

research
03/18/2023

Grounding 3D Object Affordance from 2D Interactions in Images

Grounding 3D object affordance seeks to locate objects' ”action possibil...
research
08/16/2020

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

Spatio-temporal video grounding aims to retrieve the spatio-temporal tub...
research
11/10/2017

Egocentric Hand Detection Via Dynamic Region Growing

Egocentric videos, which mainly record the activities carried out by the...
research
02/08/2020

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

Egocentric video recognition is a natural testbed for diverse interactio...
research
05/08/2018

Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

We study weakly-supervised video object grounding: given a video segment...
research
02/24/2022

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Affordance detection, which refers to perceiving objects with potential ...
research
03/26/2023

Affordance Grounding from Demonstration Video to Target Image

Humans excel at learning from expert demonstrations and solving their ow...

Please sign up or login with your details

Forgot password? Click here to reset