MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

by   Xiaofeng Wang, et al.

Learning-based Multi-View Stereo (MVS) methods warp source images into the reference camera frustum to form 3D volumes, which are fused as a cost volume to be regularized by subsequent networks. The fusing step plays a vital role in bridging 2D semantics and 3D spatial associations. However, previous methods utilize extra networks to learn 2D information as fusing cues, underusing 3D spatial correlations and bringing additional computation costs. Therefore, we present MVSTER, which leverages the proposed epipolar Transformer to learn both 2D semantics and 3D spatial associations efficiently. Specifically, the epipolar Transformer utilizes a detachable monocular depth estimator to enhance 2D semantics and uses cross-attention to construct data-dependent 3D associations along epipolar line. Additionally, MVSTER is built in a cascade structure, where entropy-regularized optimal transport is leveraged to propagate finer depth estimations in each stage. Extensive experiments show MVSTER achieves state-of-the-art reconstruction performance with significantly higher efficiency: Compared with MVSNet and CasMVSNet, our MVSTER achieves 34 and 14 reductions in running time. MVSTER also ranks first on Tanks Temples-Advanced among all published works. Code is released at


page 7

page 11

page 14

page 18

page 19

page 20


RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Learning-based multi-view stereo (MVS) has by far centered around 3D con...

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

In this paper, we present a learning-based approach for multi-view stere...

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Self-supervised monocular methods can efficiently learn depth informatio...

POEM: Reconstructing Hand in a Point Embedded Multi-view Stereo

Enable neural networks to capture 3D geometrical-aware features is essen...

Generalized Binary Search Network for Highly-Efficient Multi-View Stereo

Multi-view Stereo (MVS) with known camera parameters is essentially a 1D...

Curvature-guided dynamic scale networks for Multi-view Stereo

Multi-view stereo (MVS) is a crucial task for precise 3D reconstruction....

DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors

Camera-based 3D object detectors are welcome due to their wider deployme...

Code Repositories


MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

view repo

Please sign up or login with your details

Forgot password? Click here to reset