Subset Sampling and Its Extensions
This paper studies the subset sampling problem. The input is a set š® of n records together with a function p that assigns each record vāš® a probability p(v). A query returns a random subset X of š®, where each record vāš® is sampled into X independently with probability p(v). The goal is to store š® in a data structure to answer queries efficiently. If š® fits in memory, the problem is interesting when š® is dynamic. We develop a dynamic data structure with šŖ(1+Ī¼_š®) expected query time, šŖ(n) space and šŖ(1) amortized expected update, insert and delete time, where Ī¼_š®=ā_vāš®p(v). The query time and space are optimal. If š® does not fit in memory, the problem is difficult even if š® is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in šŖ((log^*_B n)/B+(Ī¼_š®/B)log_M/B (n/B)) amortized expected I/Os using šŖ(n/B) space, where M is the memory size, B is the block size and log^*_B n is the number of iterative log_2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with šŖ(log n+Ī¼_š®ā©[a,b]) expected query time, šŖ(n) space and šŖ(log n) amortized expected update, insert and delete time.
READ FULL TEXT