Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

01/21/2023
by   Ximing Li, et al.
0

Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the L^2 distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every ϵ-DP synthetic data generator.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/11/2021

Winning the NIST Contest: A scalable and general approach to differentially private synthetic data

We propose a general approach for differentially private synthetic data ...
research
05/28/2022

Noise-Aware Statistical Inference with Differentially Private Synthetic Data

While generation of synthetic data under differential privacy (DP) has r...
research
05/26/2023

Differentially private low-dimensional representation of high-dimensional data

Differentially private synthetic data provide a powerful mechanism to en...
research
07/19/2023

DP-TBART: A Transformer-based Autoregressive Model for Differentially Private Tabular Data Generation

The generation of synthetic tabular data that preserves differential pri...
research
05/10/2022

Mechanisms for Global Differential Privacy under Bayesian Data Synthesis

This paper introduces a new method that embeds any Bayesian model used t...
research
07/16/2023

MargCTGAN: A "Marginally” Better CTGAN for the Low Sample Regime

The potential of realistic and useful synthetic data is significant. How...

Please sign up or login with your details

Forgot password? Click here to reset