Towards a Definitive Measure of Repetitiveness

10/04/2019
by   Tomasz Kociumaka, et al.
0

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such a clear measure exists for the compressibility of repetitive sequences other than the uncomputable Kolmogorov's complexity. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel-Ziv parse are frequently used to estimate it. Recently, a more principled measure, the size γ of the smallest attractor of a string S[1..n], was introduced. Measure γ lower bounds all the previous relevant ones (e.g., z), yet S can be represented and indexed within space O(γlog(n/γ)), which also upper bounds most measures. While γ is certainly a better measure of repetitiveness, it is NP-complete to compute, and it is not known if S can always be represented in O(γ) space. In this paper we study a smaller measure, δ<γ, which can be computed in linear time. We show that δ captures better the concept of compressibility in repetitive strings: We prove that, for some string families, it holds γ = Ω(δlog n). Still, we can build a representation of S of size O(δlog(n/δ)), which supports direct access to any S[i] in time O(log(n/δ)) and finds the occ occurrences of any pattern P[1..m] in time O(mlog n + occlog^ϵ n) for any constant ϵ>0. Further, such representation is worst-case optimal because, in some families, S can only be represented in Ω(δlog n) space. We complete our characterization of δ by showing that γ, z and other measures of repetitiveness are always O(δlog(n/δ)), but in some string families, the smallest context-free grammar is of size g=Ω(δlog^2 n / loglog n). No such a lower bound is known to hold for γ.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro