DartMinHash: Fast Sketching for Weighted Sets

05/23/2020
by   Tobias Christiani, et al.
0

Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set x ∈ℝ_≥ 0^d and computes k independent minhashes in expected time O(k log k + ‖ x ‖_0log( ‖ x ‖_1 + 1/‖ x ‖_1)), improving upon the state-of-the-art BagMinHash algorithm (KDD '18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with k and ‖ x ‖_0 compared to ICWS (ICDM '10) and BagMinhash, obtaining 10x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for (L, K)-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time O(LK + ‖ x ‖_0), improving on prior work even in the case of unweighted sets.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro