MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

12/06/2021
by   Fangzhi Xu, et al.
8

Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and various diagram inputs. It brings new challenges to the representation capability of language model for domain-specific spans. And it also pushes the multimodal fusion to a more complex level. To tackle the above issues, we propose a novel model named MoCA, which incorporates multi-stage domain pretraining and multimodal cross attention for the TQA task. Firstly, we introduce a multi-stage domain pretraining module to conduct unsupervised post-pretraining with the span mask strategy and supervised pre-finetune. Especially for domain post-pretraining, we propose a heuristic generation algorithm to employ the terminology corpus. Secondly, to fully consider the rich inputs of context and diagrams, we propose cross-guided multimodal attention to update the features of text, question diagram and instructional diagram based on a progressive strategy. Further, a dual gating mechanism is adopted to improve the model ensemble. The experimental results show the superiority of our model, which outperforms the state-of-the-art methods by 2.21

READ FULL TEXT
research
03/24/2016

A Diagram Is Worth A Dozen Images

Diagrams are common tools for representing complex concepts, relationshi...
research
04/25/2020

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...
research
04/08/2019

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...
research
02/25/2019

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Multimodal attentional networks are currently state-of-the-art models fo...
research
11/25/2020

XTQA: Span-Level Explanations of the Textbook Question Answering

Textbook Question Answering (TQA) is a task that one should answer a dia...
research
08/24/2022

FashionVQA: A Domain-Specific Visual Question Answering System

Humans apprehend the world through various sensory modalities, yet langu...
research
03/22/2023

Salient Span Masking for Temporal Understanding

Salient Span Masking (SSM) has shown itself to be an effective strategy ...

Please sign up or login with your details

Forgot password? Click here to reset