Multi-View Deep Learning for Consistent Semantic Mapping with RGB-D Cameras

by   Lingni Ma, et al.

Visual scene understanding is an important capability that enables robots to purposefully act in their environment. In this paper, we propose a novel approach to object-class segmentation from multiple RGB-D views using deep learning. We train a deep neural network to predict object-class semantics that is consistent from several view points in a semi-supervised way. At test time, the semantics predictions of our network can be fused more consistently in semantic keyframe maps than predictions of a network trained on individual views. We base our network architecture on a recent single-view deep learning approach to RGB and depth fusion for semantic object-class segmentation and enhance it with multi-scale loss minimization. We obtain the camera trajectory using RGB-D SLAM and warp the predictions of RGB-D images into ground-truth annotated frames in order to enforce multi-view consistency during training. At test time, predictions from multiple views are fused into keyframes. We propose and analyze several methods for enforcing multi-view consistency during training and testing. We evaluate the benefit of multi-view consistency training and demonstrate that pooling of deep features and fusion over multiple views outperforms single-view baselines on the NYUDv2 benchmark for semantic segmentation. Our end-to-end trained network achieves state-of-the-art performance on the NYUDv2 dataset in single-view segmentation as well as multi-view semantic fusion.


page 1

page 4

page 6

page 7


Multi-view Semantic Consistency based Information Bottleneck for Clustering

Multi-view clustering can make use of multi-source information for unsup...

MVMO: A Multi-Object Dataset for Wide Baseline Multi-View Semantic Segmentation

We present MVMO (Multi-View, Multi-Object dataset): a synthetic dataset ...

Deep Multi-View Learning using Neuron-Wise Correlation-Maximizing Regularizers

Many machine learning problems concern with discovering or associating c...

SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes

The objective of this paper is 3D shape understanding from single and mu...

Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images, and Noisy OSM Training Labels

We present a novel multi-view training framework and CNN architecture fo...

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

We present 3DMV, a novel method for 3D semantic scene segmentation of RG...

Multi-view Tracking Using Weakly Supervised Human Motion Prediction

Multi-view approaches to people-tracking have the potential to better ha...

Please sign up or login with your details

Forgot password? Click here to reset