Model Joins: Enabling Analytics Over Joins of Absent Big Tables

This work is motivated by two key facts. First, it is highly desirable to be able to learn and perform knowledge discovery and analytics (LKD) tasks without the need to access raw-data tables. This may be due to organizations finding it increasingly frustrating and costly to manage and maintain ever-growing tables, or for privacy reasons. Hence, compact models can be developed from the raw data and used instead of the tables. Second, oftentimes, LKD tasks are to be performed on a (potentially very large) table which is itself the result of joining separate (potentially very large) relational tables. But how can one do this, when the individual to-be-joined tables are absent? Here, we pose the following fundamental questions: Q1: How can one "join models" of (absent/deleted) tables or "join models with other tables" in a way that enables LKD as if it were performed on the join of the actual raw tables? Q2: What are appropriate models to use per table? Q3: As the model join would be an approximation of the actual data join, how can one evaluate the quality of the model join result? This work puts forth a framework, Model Join, addressing these challenges. The framework integrates and joins the per-table models of the absent tables and generates a uniform and independent sample that is a high-quality approximation of a uniform and independent sample of the actual raw-data join. The approximation stems from the models, but not from the Model Join framework. The sample obtained by the Model Join can be used to perform LKD downstream tasks, such as approximate query processing, classification, clustering, regression, association rule mining, visualization, and so on. To our knowledge, this is the first work with this agenda and solutions. Detailed experiments with TPC-DS data and synthetic data showcase Model Join's usefulness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/29/2022

WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

Data discovery is a major challenge in enterprise data analysis: users o...
research
12/07/2019

Joins on Samples: A Theoretical Guide for Practitioners

Despite decades of research on approximate query processing (AQP), our u...
research
04/07/2021

Correlation Sketches for Approximate Join-Correlation Queries

The increasing availability of structured datasets, from Web tables and ...
research
10/07/2022

Calibration: A Simple Trick for Wide-table Delta Analytics

Data analytics over normalized databases typically requires computing an...
research
07/01/2023

Aggregation Consistency Errors in Semantic Layers and How to Avoid Them

Analysts often struggle with analyzing data from multiple tables in a da...
research
09/30/2021

Secure Machine Learning over Relational Data

A closer integration of machine learning and relational databases has ga...
research
03/07/2021

Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

Fuzzy similarity join is an important database operator widely used in p...

Please sign up or login with your details

Forgot password? Click here to reset