Compressed Multiple Pattern Matching

11/03/2018
by   Dmitry Kosolobov, et al.
0

Given d strings over the alphabet {0,1,...,σ-1}, the classical Aho--Corasick data structure allows us to find all occ occurrences of the strings in any text T in O(|T| + occ) time using O(m m) bits of space, where m is the number of edges in the trie containing the strings. Fix any constant ε∈ (0, 2). We describe a compressed solution for the problem that, provided σ< m^δ for a constant δ < 1, works in O(|T| 1/ε1/ε + occ) time, which is O(|T| + occ) since ε is constant, and occupies mH_k + 1.443 m + ε m + O(dm/d) bits of space, for all 0 < k <{0,α_σ m - 2} simultaneously, where α∈ (0,1) is an arbitrary constant and H_k is the kth-order empirical entropy of the trie. Hence, we reduce the 3.443m term in the space bounds of previously best succinct solutions to (1.443 + ε)m, thus solving an open problem posed by Belazzougui. Further, we notice that L = σ (m+1)m - O((σ m)) is a worst-case space lower bound for any solution of the problem and, for d = o(m) and constant ε, our approach allows to achieve L + ε m bits of space, which gives an evidence that, for d = o(m), the space of our data structure is theoretically optimal up to the ε m additive term and it is hardly possible to eliminate the term 1.443m. In addition, we refine the space analysis of previous works by proposing a more appropriate definition for H_k. We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset