WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

12/29/2022
by   Tianji Cong, et al.
0

Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding “semantically” joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.

READ FULL TEXT
research
10/26/2020

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applicati...
research
06/21/2022

Model Joins: Enabling Analytics Over Joins of Absent Big Tables

This work is motivated by two key facts. First, it is highly desirable t...
research
01/12/2023

Pylon: Semantic Table Union Search in Data Lakes

The large size and fast growth of data repositories, such as data lakes,...
research
04/17/2023

DIALITE: Discover, Align and Integrate Open Data Tables

We demonstrate a novel table discovery pipeline called DIALITE that allo...
research
03/12/2019

Termite: A System for Tunneling Through Heterogeneous Data

Data-driven analysis is important in virtually every modern organization...
research
02/02/2023

Tab2KG: Semantic Table Interpretation with Lightweight Semantic Profiles

Tabular data plays an essential role in many data analytics and machine ...
research
11/20/2020

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of dat...

Please sign up or login with your details

Forgot password? Click here to reset