Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling

by   Xiaoyun Li, et al.

Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying K random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into K bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around ϵ = 5∼ 10, where ϵ is the standard parameter in the language of (ϵ, δ)-DP.


page 1

page 2

page 3

page 4


Practical Differentially Private Hyperparameter Tuning with Subsampling

Tuning all the hyperparameters of differentially private (DP) machine le...

Building K-Anonymous User Cohorts with Consecutive Consistent Weighted Sampling (CCWS)

To retrieve personalized campaigns and creatives while protecting user p...

Differential Privacy with Random Projections and Sign Random Projections

In this paper, we develop a series of differential privacy (DP) algorith...

C-MinHash: Rigorously Reducing K Permutations to Two

Minwise hashing (MinHash) is an important and practical algorithm for ge...

Differentiable DAG Sampling

We propose a new differentiable probabilistic model over DAGs (DP-DAG). ...

Pb-Hash: Partitioned b-bit Hashing

Many hashing algorithms including minwise hashing (MinHash), one permuta...

C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations

Minwise hashing (MinHash) is a classical method for efficiently estimati...

Please sign up or login with your details

Forgot password? Click here to reset