Subset Sampling and Its Extensions

07/21/2023

∙

This paper studies the subset sampling problem. The input is a set 𝒮 of n records together with a function p that assigns each record v∈𝒮 a probability p(v). A query returns a random subset X of 𝒮, where each record v∈𝒮 is sampled into X independently with probability p(v). The goal is to store 𝒮 in a data structure to answer queries efficiently. If 𝒮 fits in memory, the problem is interesting when 𝒮 is dynamic. We develop a dynamic data structure with 𝒪(1+μ_𝒮) expected query time, 𝒪(n) space and 𝒪(1) amortized expected update, insert and delete time, where μ_𝒮=∑_v∈𝒮p(v). The query time and space are optimal. If 𝒮 does not fit in memory, the problem is difficult even if 𝒮 is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in 𝒪((log^*_B n)/B+(μ_𝒮/B)log_M/B (n/B)) amortized expected I/Os using 𝒪(n/B) space, where M is the memory size, B is the block size and log^*_B n is the number of iterative log_2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with 𝒪(log n+μ_𝒮∩[a,b]) expected query time, 𝒪(n) space and 𝒪(log n) amortized expected update, insert and delete time.

READ FULL TEXT

Subset Sampling and Its Extensions

Sign in with Google

Consider DeepAI Pro