Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

07/17/2023
by   Yui Iioka, et al.
0

In this study, we aim to develop a model that comprehends a natural language instruction (e.g., "Go to the living room and get the nearest pillow to the radio art on the wall") and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of the target phrase of the sentence among the multiple phrases, and (3) the generation of pixel-wise segmentation masks rather than bounding boxes. Studies have been conducted on languagebased segmentation methods; however, they sometimes mask irrelevant regions for complex sentences. In this paper, we propose the Multimodal Diffusion Segmentation Model (MDSM), which generates a mask in the first stage and refines it in the second stage. We introduce a crossmodal parallel feature extraction mechanism and extend diffusion probabilistic models to handle crossmodal features. To validate our model, we built a new dataset based on the well-known Matterport3D and REVERIE datasets. This dataset consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects, in addition to pixel-wise segmentation masks. The performance of MDSM surpassed that of the baseline method by a large margin of +10.13 mean IoU.

READ FULL TEXT

page 1

page 2

page 3

page 6

page 7

research
05/29/2023

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Recent works have explored text-guided image editing using diffusion mod...
research
09/17/2018

Mask Editor : an Image Annotation Tool for Image Segmentation Tasks

Deep convolutional neural network (DCNN) is the state-of-the-art method ...
research
02/15/2022

Few-shot semantic segmentation via mask aggregation

Few-shot semantic segmentation aims to recognize novel classes with only...
research
11/05/2019

Recurrent Instance Segmentation using Sequences of Referring Expressions

The goal of this work is to segment the objects in an image that are ref...
research
03/21/2018

Video Object Segmentation with Language Referring Expressions

Most state-of-the-art semi-supervised video object segmentation methods ...
research
07/02/2022

Towards Robust Video Object Segmentation with Adaptive Object Calibration

In the booming video era, video segmentation attracts increasing researc...
research
07/14/2023

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

This paper describes a domestic service robot (DSR) that fetches everyda...

Please sign up or login with your details

Forgot password? Click here to reset