Approximating Text-to-Pattern Hamming Distances
We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size σ, compute the Hamming distance between the pattern and the text at every location. Several (1+ϵ)-approximation algorithms have been proposed in the literature, with running time of the form O(ϵ^-O(1)nlog nlog m), all using fast Fourier transform (FFT). We describe a simple (1+ϵ)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results: - We obtain the first linear-time approximation algorithm; the running time is O(ϵ^-2n). - We obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time improves previous results by logarithmic factors and is linear if k<√(m). - We obtain approximation algorithms with better ϵ-dependence using rectangular matrix multiplication. The time-bound is Õ(n) when the pattern is sufficiently long: m>ϵ^-28. Previous algorithms require Õ(ϵ^-1n) time. - When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in O((n/k^Ω(1)+occ)n^o(1)) time, where occ is the output size. The algorithm leads to a property tester, returning true if an exact match exists and false if the Hamming distance is more than δ m at every location, running in Õ(δ^-1/3n^2/3+δ^-1n/m) time. - We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using Õ(ϵ^-2√(k)) space. Previously, streaming algorithms were known for the exact problem with Õ(k) space or for the approximate problem with Õ(ϵ^-O(1)√(m)) space.
READ FULL TEXT