One Permutation Hashing for Efficient Search and Learning

08/06/2012
by   Ping Li, et al.
0

Recently, the method of b-bit minwise hashing has been applied to large-scale linear learning and sublinear time near-neighbor search. The major drawback of minwise hashing is the expensive preprocessing cost, as the method requires applying (e.g.,) k=200 to 500 permutations on the data. The testing time can also be expensive if a new data point (e.g., a new document or image) has not been processed, which might be a significant issue in user-facing applications. We develop a very simple solution based on one permutation hashing. Conceptually, given a massive binary data matrix, we permute the columns only once and divide the permuted columns evenly into k bins; and we simply store, for each data vector, the smallest nonzero location in each bin. The interesting probability analysis (which is validated by experiments) reveals that our one permutation scheme should perform very similarly to the original (k-permutation) minwise hashing. In fact, the one permutation scheme can be even slightly more accurate, due to the "sample-without-replacement" effect. Our experiments with training linear SVM and logistic regression on the webspam dataset demonstrate that this one permutation hashing scheme can achieve the same (or even slightly better) accuracies compared to the original k-permutation scheme. To test the robustness of our method, we also experiment with the small news20 dataset which is very sparse and has merely on average 500 nonzeros in each data vector. Interestingly, our one permutation scheme noticeably outperforms the k-permutation scheme when k is not too small on the news20 dataset. In summary, our method can achieve at least the same accuracy as the original k-permutation scheme, at merely 1/k of the original preprocessing cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2011

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

We generated a dataset of 200 GB with 10^9 features, to test our recent ...
research
05/29/2018

Hierarchical One Permutation Hashing: Efficient Multimedia Near Duplicate Detection

With advances in multimedia technologies and the proliferation of smart ...
research
11/18/2021

C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations

Minwise hashing (MinHash) is a classical method for efficiently estimati...
research
09/10/2021

C-MinHash: Practically Reducing Two Permutations to Just One

Traditional minwise hashing (MinHash) requires applying K independent pe...
research
09/07/2021

C-MinHash: Rigorously Reducing K Permutations to Two

Minwise hashing (MinHash) is an important and practical algorithm for ge...
research
05/23/2011

b-Bit Minwise Hashing for Large-Scale Linear SVM

In this paper, we propose to (seamlessly) integrate b-bit minwise hashin...
research
04/21/2016

LOH and behold: Web-scale visual search, recommendation and clustering using Locally Optimized Hashing

We propose a novel hashing-based matching scheme, called Locally Optimiz...

Please sign up or login with your details

Forgot password? Click here to reset