Masked Vision-Language Transformers for Scene Text Recognition

11/09/2022
by   Jie Wu, et al.
0

Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

READ FULL TEXT

page 4

page 6

page 9

page 10

page 16

page 17

research
06/01/2023

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Modern hierarchical vision transformers have added several vision-specif...
research
05/18/2021

Vision Transformer for Fast and Efficient Scene Text Recognition

Scene text recognition (STR) enables computers to read text in natural s...
research
07/25/2023

Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition

Due to the enormous technical challenges and wide range of applications,...
research
08/20/2023

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

In recent years, end-to-end scene text spotting approaches are evolving ...
research
05/09/2023

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Vision model have gained increasing attention due to their simplicity an...
research
06/21/2023

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Scene text removal (STR) aims at replacing text strokes in natural scene...
research
02/28/2023

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

While vision transformers have been highly successful in improving the p...

Please sign up or login with your details

Forgot password? Click here to reset