Weighted Reservoir Sampling from Distributed Streams

04/08/2019
by   Rajesh Jayaram, et al.
0

We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream has only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling without replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking heavy hitters with residual error. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of ℓ_1 heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a (1/ϵ) factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed L_1 tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2019

Parallel Weighted Random Sampling

Data structures for efficient sampling from a set of weighted items are ...
research
03/28/2019

Optimal Random Sampling from Distributed Streams Revisited

We give an improved algorithm for drawing a random sample from a large d...
research
03/02/2022

Pattern Recognition and Event Detection on IoT Data-streams

Big data streams are possibly one of the most essential underlying notio...
research
04/11/2021

Simple, Optimal Algorithms for Random Sampling Without Replacement

Consider the fundamental problem of drawing a simple random sample of si...
research
10/24/2019

Communication-Efficient (Weighted) Reservoir Sampling

We consider communication-efficient weighted and unweighted (uniform) ra...
research
05/08/2023

Risk-limiting Financial Audits via Weighted Sampling without Replacement

We introduce the notion of a risk-limiting financial auditing (RLFA): gi...
research
01/06/2022

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

Stream monitoring is fundamental in many data stream applications, such ...

Please sign up or login with your details

Forgot password? Click here to reset