TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

08/03/2022
by   Jun Wang, et al.
5

Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets – only a small portion of text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code will be made publicly available.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 16

research
10/28/2020

Leveraging Visual Question Answering to Improve Text-to-Image Synthesis

Generating images from textual descriptions has recently attracted a lot...
research
08/01/2023

Making the V in Text-VQA Matter

Text-based VQA aims at answering questions by reading the text present i...
research
05/19/2023

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant succes...
research
09/10/2019

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

While models for Visual Question Answering (VQA) have steadily improved ...
research
11/10/2022

Watching the News: Towards VideoQA Models that can Read

Video Question Answering methods focus on commonsense reasoning and visu...
research
05/24/2023

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

We introduce a novel visual question answering (VQA) task in the context...
research
10/21/2019

Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient

Textual distractors in current multi-choice VQA datasets are not challen...

Please sign up or login with your details

Forgot password? Click here to reset