Saibot: A Differentially Private Data Search Platform

07/01/2023
by   Zezhou Huang, et al.
0

Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations (join or union compatible datasets) that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50 to 90 non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2023

The Fast and the Private: Task-based Dataset Search

Modern dataset search platforms employ ML task-based utility metrics ins...
research
07/25/2023

Accuracy Amplification in Differentially Private Logistic Regression: A Pre-Training Approach

Machine learning (ML) models can memorize training datasets. As a result...
research
03/02/2021

DP-InstaHide: Provably Defusing Poisoning and Backdoor Attacks with Differentially Private Data Augmentations

Data poisoning and backdoor attacks manipulate training data to induce s...
research
08/05/2022

DP^2-VAE: Differentially Private Pre-trained Variational Autoencoders

Modern machine learning systems achieve great success when trained on la...
research
01/05/2023

DP-SIPS: A simpler, more scalable mechanism for differentially private partition selection

Partition selection, or set union, is an important primitive in differen...
research
02/13/2020

Differentially Private Call Auctions and Market Impact

We propose and analyze differentially private (DP) mechanisms for call a...
research
12/16/2021

Construction of Differentially Private Summaries over Fully Homomorphic Encryption

Cloud computing has garnered attention as a platform of query processing...

Please sign up or login with your details

Forgot password? Click here to reset