Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel-Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function S_T counting the cardinalities of the sets of substrings of each length of T, also known as the substring complexity. This new measure is defined as δ= sup{S_T(k)/k, k≥ 1} and lower bounds all the measures previously considered. In particular, δ≤γ always holds and δ can be computed in 𝒪(n) time using Ω(n) working space. Kociumaka et al. showed that if δ is given, one can construct an 𝒪(δlogn/δ)-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ using 𝒪(b) space requires Ω(n^2-o(1)/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an 𝒪(n^3/b^2)-time and 𝒪(b)-space algorithm to compute δ, for any b∈[1,n]; and an 𝒪̃(n^2/b)-time and 𝒪(b)-space algorithm to compute δ, for any b∈[n^2/3,n].
READ FULL TEXT