Fast, Small, and Simple Document Listing on Repetitive Text Collections

02/20/2019
by   Dustin Cobas, et al.
0

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m+ndoc · n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2020

Contextual Pattern Matching

The research on indexing repetitive string collections has focused on th...
research
04/29/2019

Semantic Matching of Documents from Heterogeneous Collections: A Simple and Transparent Method for Practical Applications

We present a very simple, unsupervised method for the pairwise matching ...
research
04/06/2020

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...
research
06/10/2020

Tailoring r-index for metagenomics

A basic problem in metagenomics is to assign a sequenced read to the cor...
research
12/26/2019

On the Reproducibility of Experiments of Indexing Repetitive Document Collections

This work introduces a companion reproducible paper with the aim of allo...
research
11/11/2022

Efficient Immediate-Access Dynamic Indexing

In a dynamic retrieval system, documents must be ingested as they arrive...
research
11/13/2020

A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute...

Please sign up or login with your details

Forgot password? Click here to reset