Constant-delay enumeration algorithms for document spanners over nested documents

10/12/2020
by   Martin Muñoz, et al.
0

Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In recent years, the task of extracting data from large nested documents has become especially relevant. We model queries of this kind as Visibly Pushdown Transducers (VPT), a structure that extends visibly pushdown automata with outputs. Since processing a string through a VPT can generate a huge number of outputs, we are interested in the task of enumerating them one after another as efficiently as possible. This paper describes an algorithm that enumerates these elements with output-linear delay after preprocessing the string in a single pass. We show applications of this result on recursive document spanners over nested documents and show how our algorithm can be adapted to enumerate the outputs in this context.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2022

Constant-delay enumeration for SLP-compressed documents

We study the problem of enumerating results from a query over a compress...
research
03/14/2018

Constant delay algorithms for regular document spanners

Regular expressions and automata models with capture variables are core ...
research
10/25/2012

Nested Hierarchical Dirichlet Processes

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchic...
research
03/13/2021

Lightweight Selective Disclosure for Verifiable Documents on Blockchain

To achieve lightweight selective disclosure for protecting privacy of do...
research
08/20/2017

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Big Data query systems represent data in a columnar format for fast, sel...
research
01/03/2022

Efficient enumeration algorithms for annotated grammars

We introduce annotated grammars, an extension of context-free grammars w...
research
07/24/2018

Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction approach known as document spanne...

Please sign up or login with your details

Forgot password? Click here to reset