Synthetic Data for Social Good

by   Bill Howe, et al.

Data for good implies unfettered access to data. But data owners must be conservative about how, when, and why they share data or risk violating the trust of the people they aim to help, losing their funding, or breaking the law. Data sharing agreements can help prevent privacy violations, but require a level of specificity that is premature during preliminary discussions, and can take over a year to establish. We consider the generation and use of synthetic data to facilitate ad hoc collaborations involving sensitive data. A good synthetic dataset has two properties: it is representative of the original data, and it provides strong guarantees about privacy. In this paper, we discuss important use cases for synthetic data that challenge the state of the art in privacy-preserving data generation, and describe DataSynthesizer, a dataset generation tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset, with strong privacy guarantees, as output. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability - in most cases, the data owner need not specify any parameters to start generating and sharing data safely and effectively. The code implementing DataSynthesizer is publicly available on GitHub at The work on DataSynthesizer is part of the Data, Responsibly project, where the goal is to operationalize responsibility in data sharing, integration, analysis and use.


What Is Synthetic Data? The Good, The Bad, and The Ugly

Sharing data can often enable compelling applications and analytics. How...

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Synthetic data is often presented as a method for sharing sensitive info...

Privacy-Preserving Synthetic Data Generation for Recommendation Systems

Recommendation systems make predictions chiefly based on users' historic...

Synthcity: facilitating innovative use cases of synthetic data in different data modalities

Synthcity is an open-source software package for innovative use cases of...

Automatic Generation of Machine Learning Synthetic Data Using ROS

Data labeling is a time intensive process. As such, many data scientists...

Cluster Aware Mobility Encounter Dataset Enlargement

The recent emerging fields in data processing and manipulation has facil...

Data Masking with Privacy Guarantees

We study the problem of data release with privacy, where data is made av...

Please sign up or login with your details

Forgot password? Click here to reset