An extensible point-based method for data chart value detection

08/22/2023
by   Carlos Soto, et al.
0

We present an extensible method for identifying semantic points to reverse engineer (i.e. extract the values of) data charts, particularly those in scientific articles. Our method uses a point proposal network (akin to region proposal networks for object detection) to directly predict the position of points of interest in a chart, and it is readily extensible to multiple chart types and chart elements. We focus on complex bar charts in the scientific literature, on which our model is able to detect salient points with an accuracy of 0.8705 F1 (@1.5-cell max deviation); it achieves 0.9810 F1 on synthetically-generated charts similar to those used in prior works. We also explore training exclusively on synthetic data with novel augmentations, reaching surprisingly competent performance in this way (0.6621 F1) on real charts with widely varying appearance, and we further demonstrate our unchanged method applied directly to synthetic pie charts (0.8343 F1). Datasets, trained models, and evaluation code are available at https://github.com/BNLNLP/PPN_model.

READ FULL TEXT
research
09/20/2023

Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Scientific articles published prior to the "age of digitization" ( 1997)...
research
07/24/2023

PG-RCNN: Semantic Surface Point Generation for 3D Object Detection

One of the main challenges in LiDAR-based 3D object detection is that th...
research
08/08/2021

From Voxel to Point: IoU-guided 3D Object Detection for Point Cloud with Voxel-to-Point Decoder

In this paper, we present an Intersection-over-Union (IoU) guided two-st...
research
02/22/2023

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

Scientific articles published prior to the "age of digitization" in the ...
research
09/09/2022

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Scientific articles published prior to the "age of digitization" in the ...
research
03/07/2023

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Large language models (LLMs) show great potential for synthetic data gen...
research
03/17/2022

Object Localization under Single Coarse Point Supervision

Point-based object localization (POL), which pursues high-performance ob...

Please sign up or login with your details

Forgot password? Click here to reset