What is Synthetic Data?
The concept of synthetic data seems to be having “a moment” in the privacy world as a promising approach to sharing data while protecting privacy. Strictly speaking, any time you make up data, you have produced a synthetic dataset, but more specifically
a synthetic dataset is a stand-in for some original dataset that has the same format, and accurately reflects the statistical properties of the original dataset, but contains only “fake” records.
Intuitively, a synthetic dataset can be used as if it were the real data—we can stare at it, compute summary statistics from it, train models from it, and do everything we normally do with data—but because the records do not correspond to “real” people, we don’t have to worry about protecting privacy.
To be sure, synthetic data is a very appealing concept, but
simply making data “synthetic” does not guarantee privacy in any meaningful sense of the word,
and we need to be careful about what it actually means to generate private synthetic data.
In this post I will briefly describe differentially private synthetic data, why we need it, and what we know about producing it. The post will serve as a warmup to some more technical posts that will follow later, describing general approaches and state-of-the-art algorithms for producing synthetic data.
What is synthetic data?
Given a dataset \(X\), a synthetic dataset \(Y\) is a new dataset that has the same structure as \(X\), but whose elements are “fake.” For example, if \(X\) is a highly stylized version of the data collected in the US decennial census, then each element of \(X\) would be a tuple of the form (age, census_block, sex, race, ethnicity) corresponding to the data of a real person who filled out a census card. A synthetic dataset should also contain such tuples, but they no longer need to correspond to real people.
An example of a synthetic dataset for the census would be a dataset consisting of 300,000,000 individuals, all living at one location in rural Mississippi, and all of whom are white, non-hispanic females aged 81. And therein lies the challenge. Of course we want a synthetic dataset not to merely have the same structure as the original data, but also to preserve the statistical properties of the original data. A synthetic dataset for the census should put roughly the right number of individuals of the right types in the right places to reflect the actual population of the US.
Thus, to generate synthetic data, we have to know something about the original data, which creates opportunities to leak sensitive information.
Synthetic data from generative modeling
Ensuring that a synthetic dataset really only reflects the statistical properties of the real data, and doesn’t encode sensitive information about the individuals in the real dataset, is quite difficult. I could go on forever about strawman proposals that would generate records that appear to be “fake” but really just encode the original dataset in obvious ways. Just so you believe I could do it, suppose for every real record, we get a fake record by adding 10 years to that individual’s age. But what people typically mean when they talk about generating synthetic data, and what the synthetic data products being sold by companies typically do, is train a generative model for the dataset \(X\) and obtain the synthetic dataset \(Y\) by sampling from that model. The hope is that samples from the generative model look like the original dataset in aggregate, but the actual people in the sample have nothing to do with the real people in the original dataset. For example, a generative model from the US population will lead to more people living in California than in Montana, but ideally wouldn’t simply spit out the data of the actual residents of California.
Generative models don’t magically protect privacy
Simply training a generative model on \(X\) doesn’t actually mean we’ve hidden anything about \(X\) or the people in it. For an extreme example, suppose the records in \(X\) are \(x_1,\dots,x_n\) and our generative model is arbitrarily expressive and determines that the best model for the data is just the uniform distribution on the set \(x_1,\dots,x_n\). Then if our synthetic dataset consists of \(n\) iid samples from the generative model, our synthetic dataset will simply contain one or more copies of the real data for a large fraction (about 70%) of the people in the original dataset.
Reconstruction attacks say hello
My example was admittedly a strawman, and when we train a generative model we typically expect it to produce new examples that aren’t in the original training dataset. Maybe by training generative models in the right way, we can ensure that the generative model only captures statistics about the dataset, without revealing information about the individuals in the dataset? Let’s not forget that
a synthetic dataset is just one of many possible ways to release statistical information about a dataset, and we know from the theory of reconstruction attacks that any mechanism that accurately preserves too many statistics inevitably destroys privacy.
Specifically, there is a rich and fairly comprehensive theory of reconstruction attacks, whose theory and practice we detailed in earlier posts. This theory says that any synthetic dataset that preserves answers to every statistic of the form “How many elements of the dataset satisfies property P?” with any meaningful notion of accuracy, is spectacularly non-private, in the sense that an attacker can reconstruct nearly all of the private information in the original dataset. Viewed through this lens, synthetic data becomes something of a red herring. The important question is which subset of statistics we want to preserve and how to preserve them without compromising privacy.
Differentially private synthetic data
Making a dataset “synthetic” may not be a magic bullet, but it’s still a useful goal. Fortunately, we already have a good working definition of what it means for an algorithm to protect the privacy of the individuals in the dataset—differential privacy. So perhaps we can design a differentially private algorithm that takes the original dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves many of the properties of \(X\) accurately (but still within the limits imposed by reconstruction attacks)? In fact, the answer turns out to be a resounding yes!
There is a differentially private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves an exponential number of statistical properties of \(X\) [BLR08].
This statement is intentionally quite informal, but intutively this result says that we can hope to do amazing things, like generate a synthetic dataset that satisfies the strong guarantee of differential privacy while still approximately preserving impressively complex statistics of the dataset, such as the marginal distribution of every set of three attributes, or the prediction error of every linear classifier.
So what’s the problem? Well, primarily it’s computational complexity. This algorithm, and the many beautiful algorithms it inspired, all have exponential worst-case running time. Specifically, if each element of the dataset contains \(d\) bits, then the worst-case running time is at least as large as \(2^d\). Unfortunately, this turns out to be an inherent bottleneck for any algorithm that generates synthetic data.
Any private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves even just the correlations between each pair of features must have worst-case running time that grows exponentially in the number of features, under widely believed complexity assumptions. [UV11]
While there are inherent limits of the accuracy of all differentially private algorithms (see [DSSU17]), and there are significant barriers to making differentially private algorithms practical,
computational complexity is the main barrier that is specific to differentially private algorithms for generating synthetic data.1
Where do we go from here?
Despite the computational bottlenecks, there has stll been a lot of amazing progress on differentially private synthetic data. And, despite my criticism of generative modeling in isolation as a means of generating synthetic data, most of these approaches are indeed based on differentially private generative models such as sparse graphical models or GANs, and leverage the fact that these types of models can typically be fit efficiently to realistic data, despite their worst-case hardness. In the next few posts, we will try to describe the landscape of the most promising approaches that we currently have available.
-
See [AACGKLSSTZ21] for an interesting example of a statistical limitation that is specific to differentially private algorithms that generate synthetic data. ↩
[cite this]