Sequential algorithms for testing identity and closeness of distributions
What advantage do sequential procedures provide over batch algorithms for testing properties of unknown distributions? Focusing on the problem of testing whether two distributions 𝒟_1 and 𝒟_2 on {1,…, n} are equal or ϵ-far, we give several answers to this question. We show that for a small alphabet size n, there is a sequential algorithm that outperforms any batch algorithm by a factor of at least 4 in terms sample complexity. For a general alphabet size n, we give a sequential algorithm that uses no more samples than its batch counterpart, and possibly fewer if the actual distance TV(𝒟_1, 𝒟_2) between 𝒟_1 and 𝒟_2 is larger than ϵ. As a corollary, letting ϵ go to 0, we obtain a sequential algorithm for testing closeness when no a priori bound on TV(𝒟_1, 𝒟_2) is given that has a sample complexity 𝒪̃(n^2/3/TV(𝒟_1, 𝒟_2)^4/3): this improves over the 𝒪̃(n/log n/TV(𝒟_1, 𝒟_2)^2) tester of <cit.> and is optimal up to multiplicative constants. We also establish limitations of sequential algorithms for the problem of testing identity and closeness: they can improve the worst case number of samples by at most a constant factor.
READ FULL TEXT