Beyond Image to Depth: Improving Depth Prediction using Echoes

by   Kranti Kumar Parida, et al.

We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improved depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28 improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results. The code and models are available at


page 1

page 4

page 7

page 8

page 13

page 14

page 15


Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Binaural audio gives the listener an immersive experience and can enhanc...

Dynamic Fusion Network For Light Field Depth Estimation

Focus based methods have shown promising results for the task of depth e...

PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing

Depth estimation and scene parsing are two particularly important tasks ...

Learning Multi-modal Information for Robust Light Field Depth Estimation

Light field data has been demonstrated to facilitate the depth estimatio...

CroMo: Cross-Modal Learning for Monocular Depth Estimation

Learning-based depth estimation has witnessed recent progress in multipl...

Multi-modal Bifurcated Network for Depth Guided Image Relighting

Image relighting aims to recalibrate the illumination setting in an imag...

Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention

Binaural audio gives the listener the feeling of being in the recording ...

Please sign up or login with your details

Forgot password? Click here to reset