Scalable and robust set similarity join

07/21/2017
by   Tobias Christiani, et al.
0

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given level of similarity (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100 recall may not be important --- indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100 that it significantly outperforms state-of-the-art implementations of exact methods, and improves on existing approximate methods. Our experiments on benchmark data sets show the method is several times faster than comparable approximate methods, at 90 of magnitude faster than exact methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the database community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2020

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for lar...
research
04/09/2018

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...
research
11/20/2017

Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

The Exact Set Similarity Join problem aims to find all similar sets betw...
research
03/07/2021

Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

Fuzzy similarity join is an important database operator widely used in p...
research
10/29/2018

Efficient Taxonomic Similarity Joins with Adaptive Overlap Constraint

A similarity join aims to find all similar pairs between two collections...
research
04/16/2018

Adaptive MapReduce Similarity Joins

Similarity joins are a fundamental database operation. Given data sets S...
research
06/13/2017

Preference-driven Similarity Join

Similarity join, which can find similar objects (e.g., products, names, ...

Please sign up or login with your details

Forgot password? Click here to reset