Subset Sampling and Its Extensions

07/21/2023
āˆ™
by   Jinchao Huang, et al.
āˆ™
0
āˆ™

This paper studies the subset sampling problem. The input is a set š’® of n records together with a function p that assigns each record vāˆˆš’® a probability p(v). A query returns a random subset X of š’®, where each record vāˆˆš’® is sampled into X independently with probability p(v). The goal is to store š’® in a data structure to answer queries efficiently. If š’® fits in memory, the problem is interesting when š’® is dynamic. We develop a dynamic data structure with š’Ŗ(1+Ī¼_š’®) expected query time, š’Ŗ(n) space and š’Ŗ(1) amortized expected update, insert and delete time, where Ī¼_š’®=āˆ‘_vāˆˆš’®p(v). The query time and space are optimal. If š’® does not fit in memory, the problem is difficult even if š’® is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in š’Ŗ((log^*_B n)/B+(Ī¼_š’®/B)log_M/B (n/B)) amortized expected I/Os using š’Ŗ(n/B) space, where M is the memory size, B is the block size and log^*_B n is the number of iterative log_2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with š’Ŗ(log n+Ī¼_š’®āˆ©[a,b]) expected query time, š’Ŗ(n) space and š’Ŗ(log n) amortized expected update, insert and delete time.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset