Bounds and Estimates on the Average Edit Distance

11/13/2022
by   Gianfranco Bilardi, et al.
0

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let e_k(n) denote the average edit distance between random, independent strings of n characters from an alphabet of size k. For k ≥ 2, it is an open problem how to efficiently compute the exact value of α_k(n) = e_k(n)/n as well as of α_k = lim_n →∞α_k(n), a limit known to exist. This paper shows that α_k(n)-Q(n) ≤α_k ≤α_k(n), for a specific Q(n)=Θ(√(log n / n)), a result which implies that α_k is computable. The exact computation of α_k(n) is explored, leading to an algorithm running in time T=𝒪(n^2kmin(3^n,k^n)), a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how α_k(n) can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of n say up to a quarter million. Correspondingly, 99.9% confidence intervals of width approximately 10^-2 are obtained for α_k. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound β_k^* to α_k, such that lim_k →∞β_k^*=1. In general, β_k^* ≤α_k ≤ 1-1/k; for k greater than a few dozens, computing β_k^* is much faster than generating good statistical estimates with confidence intervals of width 1-1/k-β_k^*. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2023

Optimal Algorithms for Bounded Weighted Edit Distance

The edit distance of two strings is the minimum number of insertions, de...
research
08/20/2021

Does Preprocessing help in Fast Sequence Comparisons?

We study edit distance computation with preprocessing: the preprocessing...
research
04/10/2019

Reducing approximate Longest Common Subsequence to approximate Edit Distance

Given a pair of strings, the problems of computing their Longest Common ...
research
11/24/2021

Gap Edit Distance via Non-Adaptive Queries: Simple and Optimal

We study the problem of approximating edit distance in sublinear time. T...
research
05/17/2020

Towards Efficient Interactive Computation of Dynamic Time Warping Distance

The dynamic time warping (DTW) is a widely-used method that allows us to...
research
11/25/2019

Faster Privacy-Preserving Computation of Edit Distance with Moves

We consider an efficient two-party protocol for securely computing the s...
research
10/25/2020

An Improved Sketching Bound for Edit Distance

We provide improved upper bounds for the simultaneous sketching complexi...

Please sign up or login with your details

Forgot password? Click here to reset