Tablext: A Combined Neural Network And Heuristic Based Table Extractor

04/22/2021
by   Zach Colter, et al.
0

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today's popular state-of-the-art methods for table extraction struggle to adequately extract tables with machine-readable text and structural data. To make matters worse, many tables do not have machine-readable data, such as tables saved as images, making most extraction methods completely ineffective. In order to address these issues, a novel, general format table extractor tool, Tablext, is proposed. This tool uses a combination of computer vision techniques and machine learning methods to efficiently and effectively identify and extract data from tables. Tablext begins by using a custom Convolutional Neural Network (CNN) to identify and separate all potential tables. The identification process is optimized by combining the custom CNN with the YOLO object detection network. Then, the high-level structure of each table is identified with computer vision methods. This high-level, structural meta-data is used by another CNN to identify exact cell locations. As a final step, Optical Characters Recognition (OCR) is performed on every individual cell to extract their content without needing machine-readable text. This multi-stage algorithm allows for the neural networks to focus on completing complex tasks, while letting image processing methods efficiently complete the simpler ones. This leads to the proposed approach to be general-purpose enough to handle a large batch of tables regardless of their internal encodings or their layout complexity. Additionally, it becomes accurate enough to outperform competing state-of-the-art table extractors on the ICDAR 2013 table dataset.

READ FULL TEXT

page 1

page 4

page 5

page 6

research
05/01/2020

Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context

Documents are often the format of choice for knowledge sharing and prese...
research
06/25/2021

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

Spreadsheet table detection is the task of detecting all tables on a giv...
research
02/26/2019

A framework for information extraction from tables in biomedical literature

The scientific literature is growing exponentially, and professionals ar...
research
10/09/2020

Table Structure Recognition using Top-Down and Bottom-Up Cues

Tables are information-rich structured objects in document images. While...
research
05/05/2021

Towards an efficient framework for Data Extraction from Chart Images

In this paper, we fill the research gap by adopting state-of-the-art com...
research
01/08/2020

Table Structure Extraction with Bi-directional Gated Recurrent Unit Networks

Tables present summarized and structured information to the reader, whic...
research
05/18/2020

An Efficient Machine-Learning Approach for PDF Tabulation in Turbulent Combustion Closure

Probability density function (PDF) based turbulent combustion modelling ...

Please sign up or login with your details

Forgot password? Click here to reset