Population Recovery from the Deletion Channel: Nearly Matching Trace Reconstruction Bounds
The population recovery problem asks one to recover an unknown distribution over n-bit strings given query access to independent noisy samples of strings drawn from the distribution. Recently, Ban et. al. [BCF+19] studied the problem where the unknown distribution over n-bit strings is known to be ℓ-sparse for some fixed ℓ, and the noise is induced through the deletion channel. The deletion channel is a noise model where each bit of the string is independently deleted with some fixed probability, and the retained bits are concatenated. We note that if ℓ = 1, i.e., we are trying to learn a single string, learning the distribution is equivalent to the famous trace reconstruction problem. The best known algorithms for trace reconstruction require (O(n^1/3)) samples. For population recovery under the deletion channel, Ban et. al. provided an algorithm that could learn ℓ-sparse distributions over strings using (n^1/2· (log n)^O(ℓ)) samples. In this work, we provide an algorithm that learns the distribution using only (Õ(n^1/3) ·ℓ^2) samples, by developing a higher-moment analog of the algorithms of [DOS17, NP17]. We also give the first algorithm with a runtime subexponential in n, which solves population recovery in (Õ(n^1/3) ·ℓ^3) samples and time. Notably, our dependence on n nearly matches the known upper bound when ℓ = 1, and we reduce the dependence on ℓ from doubly to nearly singly exponential. Therefore, we are able to learn the mixture even for much larger values of ℓ. For instance, Ban et. al.'s algorithm can only learn a mixture of O(log n/loglog n) strings with a subexponential number of queries, whereas we are able to learn a mixture of up to n^o(1) strings in subexponential queries and time.
READ FULL TEXT