GPTR: Gestalt-Perception Transformer for Diagram Object Detection

by   Xin Hu, et al.

Diagram object detection is the key basis of practical applications such as textbook question answering. Because the diagram mainly consists of simple lines and color blocks, its visual features are sparser than those of natural images. In addition, diagrams usually express diverse knowledge, in which there are many low-frequency object categories in diagrams. These lead to the fact that traditional data-driven detection model is not suitable for diagrams. In this work, we propose a gestalt-perception transformer model for diagram object detection, which is based on an encoder-decoder architecture. Gestalt perception contains a series of laws to explain human perception, that the human visual system tends to perceive patches in an image that are similar, close or connected without abrupt directional changes as a perceptual whole object. Inspired by these thoughts, we build a gestalt-perception graph in transformer encoder, which is composed of diagram patches as nodes and the relationships between patches as edges. This graph aims to group these patches into objects via laws of similarity, proximity, and smoothness implied in these edges, so that the meaningful objects can be effectively detected. The experimental results demonstrate that the proposed GPTR achieves the best results in the diagram object detection task. Our model also obtains comparable results over the competitors in natural image object detection.


page 3

page 4


A Diagram Is Worth A Dozen Images

Diagrams are common tools for representing complex concepts, relationshi...

RL-CSDia: Representation Learning of Computer Science Diagrams

Recent studies on computer vision mainly focus on natural images that ex...

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Current visual question answering (VQA) tasks mainly consider answering ...

Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams

In this work, we introduce a new algorithm for analyzing a diagram, whic...

Detecting People in Cubist Art

Although the human visual system is surprisingly robust to extreme disto...

OCR-VQGAN: Taming Text-within-Image Generation

Synthetic image generation has recently experienced significant improvem...

Structured Set Matching Networks for One-Shot Part Labeling

Diagrams often depict complex phenomena and serve as a good test bed for...

Please sign up or login with your details

Forgot password? Click here to reset