Differential PrivacyWebsite for the differential privacy research community
https://differentialprivacy.org
Differentially private deep learning can be effective with self-supervised models<p>Differential Privacy (DP) is a formal definition of privacy which guarantees that the outcome of a statistical procedure does not vary much regardless of whether an individual input is included or removed from the training dataset.
This guarantee is desirable when we are tasked to train machine learning models on private datasets that should not memorize individual inputs.
Past works have shown that differentially private models can be resilient to strong membership inference [<a href="https://proceedings.mlr.press/v37/kairouz15.html">1</a>, <a href="https://ieeexplore.ieee.org/abstract/document/9519424">34</a>, <a href="https://proceedings.neurips.cc/paper/2020/hash/fc4ddc15f9f4b4b06ef7844d6bb53abf-Abstract.html">35</a>] and data reconstruction attacks [<a href="https://www.usenix.org/conference/usenixsecurity19/presentation/carlini">2</a>, <a href="https://arxiv.org/abs/2201.12383">3</a>] when the privacy parameter is set to be sufficiently small.
See a <a href="https://differentialprivacy.org/how-to-deploy-ml-with-dp/">prior post</a> for more background on differentially private machine learning.</p>
<p>Yet, in practice, most attempts at training differentially private deep learning models on moderately-sized datasets have resulted in large performance drops compared to when training without privacy-protection baked in.
These performance drops are oftentimes large enough to discourage the adoption of differential privacy protection into machine learning pipelines altogether.</p>
<p>To provide a reference of the potential performance hit, the authors of [<a href="https://arxiv.org/abs/2102.12677">5</a>] trained a ResNet-20 from scratch on CIFAR-10 with a privacy budget of \(\epsilon=8\) that has test accuracy barely over 62% (see their Table 1).
Contrast this with the 8.75% error rate (91.25% accuracy) reported for training the same architecture without enforcing differential privacy [<a href="https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html">6</a>].
While some works report private learning results better than the above, absent additional data, pre-training, or external knowledge, most improvements have been incremental, and the test accuracy for CIFAR-10 models trained under modest privacy leakage (\(\epsilon=3\)) has roughly settled to ~70% in the literature [<a href="https://arxiv.org/abs/2011.11660">4</a>].</p>
<p>One reason behind the performance drop lies in sample efficiency — differentially private learning generally requires much more data than non-private learning to reach an acceptable level of performance.
This also means that learning the high-level features (e.g., syntactic structure in text, edge detectors for images) necessary to perform specific tasks with private data can be much more sample-costly.</p>
<p>This blog post surveys results that leverage public self-supervised pre-training to obtain high-performing models through differentially private fine-tuning.
The pre-train-fine-tune paradigm is straightforward to execute and results in high-performing models under modest privacy budgets for many standard computer vision and natural language processing tasks.
Moreover, existing results have shown that private fine-tuning consistently benefits from improvements in public pre-training.</p>
<h2 id="self-supervised-pre-training">Self-Supervised Pre-Training</h2>
<p>Self-supervised learning is a paradigm which leverages unlabeled data to learn representations that can be useful for a range of downstream tasks.
Since self-supervised learning doesn’t target specific tasks itself,
the (pre-)training procedure doesn’t require labeled data — in many cases, mildly curated unlabeled data is sufficient for self-supervised pre-training to produce models for subsequent fine-tuning.
So far, there have been two broadly successful instantiations of this learning paradigm in computer vision [<a href="http://proceedings.mlr.press/v119/chen20j.html">9</a>] and natural language processing [<a href="https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf">7</a>, <a href="https://arxiv.org/abs/1810.04805">8</a>].
We recap the two approaches below.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p><strong>Contrastive pre-training for vision:</strong>
One class of self-supervised methods in computer vision (SimCLR, [<a href="http://proceedings.mlr.press/v119/chen20j.html">9</a>]) performs pre-training through contrastive learning.
Algorithms of this type produce embeddings for images with the goal of creating different embeddings for semantically different images and similar embeddings for similar ones.
Concretely, the algorithm used in SimCLR forces models to produce similar embeddings for an image and its augmented siblings (e.g., image rotated by some degrees),
and different embeddings for separate images (and their augmentations).
The SimCLR framework with large scale models and compute led to state-of-the-art (non-private) ImageNet fine-tuning results at the time of its writing.</p>
<p><strong>Masked language modeling and autoregressive language modeling for text:</strong>
Masked Language Modeling (MLM) and Auto-regressive Language Modeling (ALM) are two self-supervised pre-training approaches.
While the former asks models to predict deliberately masked out tokens from a piece of text, the latter asks models to simply predict the next token in a sequence.
With large amounts of unlabeled text data, large and expressive Transformer models [<a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">24</a>], and lots of compute, both approaches produce powerful models that are good starting points for downstream fine-tuning.
For instance, Bidirectional Encoder Representations from Transformers (BERT, [<a href="https://arxiv.org/abs/1810.04805">8</a>]), produced state-of-the-art (non-private) results (at the time) for a large collection of language understanding tasks when fine-tuned on each.</p>
<h2 id="fine-tuning-self-supervised-models-with-dp-optimization">Fine-Tuning Self-Supervised Models With DP-Optimization</h2>
<p>Self-supervised pre-training is appealing in the context of differentially private machine learning.
This is because (i) the mildly curated data needed for pre-training can usually be obtained cheaply from the public domain, and (ii) pre-trained models may contain useful domain knowledge that can reduce the sample complexity of subsequent private learning.
A paradigm for private learning that leverages self-supervised pre-training could follow two steps:</p>
<ul>
<li>collect cheap and public (unlabeled) data from the task domain (e.g., vision, language, etc.) to pre-train a model with self-supervised learning, and</li>
<li>collect moderate amounts of task-specific private (labeled) data and fine-tune the pre-trained model under differential privacy to perform the task.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></li>
</ul>
<p>To date, some of the best differentially private deep learning results in the literature have resulted from instantiating this paradigm [<a href="https://arxiv.org/abs/2011.11660">4</a>, <a href="https://arxiv.org/abs/2110.05679">11</a>, <a href="https://arxiv.org/abs/2110.06500">12</a>].
Below, we review works which capitalize on self-supervised pre-training by differentially privately fine-tuning pre-trained models with an iterative gradient method like DP-SGD [<a href="https://dl.acm.org/doi/abs/10.1145/2976749.2978318">19</a>, <a href="https://ieeexplore.ieee.org/abstract/document/6736861">20</a>].<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
<img src="/images/fine-tuning-paradigm.png" alt="" /></p>
<p><strong>Private fine-tuning with SimCLR features:</strong>
The authors of [<a href="https://arxiv.org/abs/2011.11660">4</a>] fine-tuned a linear model on top of the embedding vectors produced by SimCLRv2 from the CIFAR-10 dataset. Under a privacy budget of \(\epsilon=2\),
these models reached an average test accuracy of 92.7%. This number can be further improved to ~94% with the use of larger and wider pre-trained models in the SimCLRv2 family.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>
These test accuracies are very close to some standard non-private results attained by an off-the-shelf ResNet architecture [<a href="https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html">6</a>].</p>
<p><strong>Privately fine-tuning BERT variants and GPT-2:</strong>
The authors of [<a href="https://arxiv.org/abs/2110.05679">11</a>, <a href="https://arxiv.org/abs/2110.06500">12</a>, <a href="http://proceedings.mlr.press/v139/yu21f.html">16</a>] showed that with appropriate hyper-parameters, fine-tuning BERT variants and GPT-2 with DP-optimization results in high-performing private models for text classification and language generation — even on datasets of modest sizes and under modest privacy budgets.
Notably, some of these models attain a task performance close to non-private models from previous years in the literature.
These results also exceed many non-private learning results from the pre-BERT and pre-GPT years.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>More interestingly, the authors showed that the larger (and thus better) the pre-trained model, the better the private fine-tuning performance gets.
This empirical observation in private fine-tuning of large Transformers is qualitatively different from what’s implied by the usual minimax optimal rates derived for vanilla private learning with convex loss functions under approximate differential privacy [<a href="https://ieeexplore.ieee.org/abstract/document/6979031">14</a>, <a href="https://proceedings.neurips.cc/paper/2019/hash/3bd8fdb090f1f5eb66a00c84dbc5ad51-Abstract.html">15</a>].
This discrepancy between experimental results for training large models and the theory for learning with convex losses suggests there is more to be understood.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>Overall, for both vision and language tasks, private learning performance has consistently improved with the improvement in the quality of pre-training,
where the latter is measured by the non-private fine-tuning performance.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p>
<p float="left">
<img src="../images/figure1_classification.png" width="48%" />
<img src="../images/figure1_generation.png" width="48%" />
<caption>Figure 1: Privately fine-tuning better (and larger) pre-trained models lead to consistently improving performance for text classification and language generation.
Left: text classification on MNLI [<a href="https://arxiv.org/abs/1704.05426">25</a>]. Right: language generation on E2E [<a href="https://arxiv.org/abs/1706.09254">26</a>].</caption>
</p>
<h2 id="conclusion-and-outlook">Conclusion and Outlook</h2>
<p>We surveyed recent works in the literature that obtained highly performant private machine learning models leveraging self-supervised pre-training.
Common to these results is the trend that the performance of private learning consistently improved with the quality of public pre-training.
We therefore anticipate that the general paradigm may be useful in additional settings (e.g., federated learning) and tasks (e.g., private synthetic image generation), and lead to better private learning results.</p>
<p>We have thus far assumed that the data for public pre-training can be cheaply obtained.
This, however, does not imply that determining whether a particular source of data is appropriate for public pre-training is an easy problem.
Using publicly available data is not necessarily risk-free in terms of privacy.
For instance, the authors of [<a href="https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting">33</a>] were able to extract personally identifiable information from a GPT-2 model pre-trained on data scraped from the public internet.</p>
<p>Self-supervised pre-training has led to progress in private deep learning, but leveraging pre-trained models alone will not address several fundamental challenges to differentially private learning.
First and foremost, the datasets of machine learning tasks may be sampled from long-tailed distributions [<a href="https://proceedings.neurips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html">21</a>].
When privately trained on such datasets, a machine learning model may fail to acquire the learning signal necessary to perform accurate predictions for examples on the tail [<a href="https://dl.acm.org/doi/abs/10.1145/3442188.3445934">28</a>] or from underrepresented (sub)populations [<a href="https://proceedings.neurips.cc/paper/2019/hash/fc0de4e0396fff257ea362983c2dda5a-Abstract.html">29</a>].
Second, many machine learning problems are in a domain where public data (even unlabeled data) may be sparse, e.g., medical imaging.
Developing refined versions of the pre-train-fine-tune approach for problems from these domains is an interesting avenue for future work.</p>
<p>Lastly, differential privacy as one specific definition of privacy may not capture all that’s desired for privacy in reality.
For instance, while differentially private algorithms naturally give machine unlearning guarantees [<a href="https://ieeexplore.ieee.org/abstract/document/9519428">30</a>, <a href="https://ieeexplore.ieee.org/abstract/document/7163042">32</a>], tailored unlearning algorithms tend to have higher capacities of unlearning [<a href="https://proceedings.neurips.cc/paper/2021/hash/9627c45df543c816a3ddf2d8ea686a99-Abstract.html">31</a>].
In addition, what constitutes a record in the differential privacy framework can oftentimes be unclear.
Inappropriately defined example boundaries can create correlated records which cause differential privacy guarantees to degrade [<a href="https://arxiv.org/abs/1603.01508">22</a>].
Moreover, differential privacy guarantees won’t directly prevent the inference of private data outside the original context [<a href="https://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/washlr79&section=16">23</a>].
These are fundamental limitations of differential privacy which improvements to differentially private learning won’t touch on.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The authors thank Nicolas Papernot and Gautam Kamath for detailed feedback and edit suggestions.</p>
<hr />
<h2 id="references">References</h2>
<p>[1] Rahman MA, Rahman T, Laganière R, Mohammed N, Wang Y. Membership Inference Attack against Differentially Private Deep Learning Model. Trans. Data Priv.. 2018 Apr 1;11(1):61-79.</p>
<p>[2] Carlini N, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19) 2019 (pp. 267-284).</p>
<p>[3] Guo C, Karrer B, Chaudhuri K, van der Maaten L. Bounding Training Data Reconstruction in Private (Deep) Learning. arXiv preprint arXiv:2201.12383. 2022 Jan 28.</p>
<p>[4] Tramer F, Boneh D. Differentially private learning needs better features (or much more data). arXiv preprint arXiv:2011.11660. 2020 Nov 23.</p>
<p>[5] Yu D, Zhang H, Chen W, Liu TY. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. arXiv preprint arXiv:2102.12677. 2021 Feb 25.</p>
<p>[6] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778).</p>
<p>[7] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training.</p>
<p>[8] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.</p>
<p>[9] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning 2020 Nov 21 (pp. 1597-1607). PMLR.</p>
<p>[10] Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. 2021 Jan 1.</p>
<p>[11] Li X, Tramer F, Liang P, Hashimoto T. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679. 2021 Oct 12.</p>
<p>[12] Yu D, Naik S, Backurs A, Gopi S, Inan HA, Kamath G, Kulkarni J, Lee YT, Manoel A, Wutschitz L, Yekhanin S. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500. 2021 Oct 13.</p>
<p>[13] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 Jul 26.</p>
<p>[14] Bassily R, Smith A, Thakurta A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In2014 IEEE 55th Annual Symposium on Foundations of Computer Science 2014 Oct 18 (pp. 464-473). IEEE.</p>
<p>[15] Bassily R, Feldman V, Talwar K, Guha Thakurta A. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems. 2019;32.</p>
<p>[16] Yu D, Zhang H, Chen W, Yin J, Liu TY. Large scale private learning via low-rank reparametrization. InInternational Conference on Machine Learning 2021 Jul 1 (pp. 12208-12218). PMLR.</p>
<p>[17] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24;1(8):9.</p>
<p>[18] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, Brynjolfsson E, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021 Aug 16.</p>
<p>[19] Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security 2016 Oct 24 (pp. 308-318).</p>
<p>[20] Song S, Chaudhuri K, Sarwate AD. Stochastic gradient descent with differentially private updates. In2013 IEEE Global Conference on Signal and Information Processing 2013 Dec 3 (pp. 245-248). IEEE.</p>
<p>[21] Feldman V, Zhang C. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems. 2020;33:2881-91.</p>
<p>[22] Ghosh A, Kleinberg R. Inferential privacy guarantees for differentially private mechanisms. arXiv preprint arXiv:1603.01508. 2016 Mar 4.</p>
<p>[23] Nissenbaum H. Privacy as contextual integrity. Wash. L. Rev.. 2004;79:119.</p>
<p>[24] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.</p>
<p>[25] Williams A, Nangia N, Bowman SR. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. 2017 Apr 18.</p>
<p>[26] Novikova J, Dušek O, Rieser V. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254. 2017 Jun 28.</p>
<p>[27] Papernot N, Chien S, Song S, Thakurta A, Erlingsson U. Making the shoe fit: Architectures, initializations, and tuning for learning with privacy.</p>
<p>[28] Suriyakumar VM, Papernot N, Goldenberg A, Ghassemi M. Chasing your long tails: Differentially private prediction in health care settings. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021 Mar 3 (pp. 723-734).</p>
<p>[29] Bagdasaryan E, Poursaeed O, Shmatikov V. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems. 2019;32.</p>
<p>[30] Bourtoule L, Chandrasekaran V, Choquette-Choo CA, Jia H, Travers A, Zhang B, Lie D, Papernot N. Machine unlearning. In2021 IEEE Symposium on Security and Privacy (SP) 2021 May 24 (pp. 141-159). IEEE.</p>
<p>[31] Sekhari A, Acharya J, Kamath G, Suresh AT. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems. 2021 Dec 6;34.</p>
<p>[32] Cao Y, Yang J. Towards making systems forget with machine unlearning. In2015 IEEE Symposium on Security and Privacy 2015 May 17 (pp. 463-480). IEEE.</p>
<p>[33] Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown T, Song D, Erlingsson U, Oprea A. Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21) 2021 (pp. 2633-2650).</p>
<p>[34] Nasr M, Songi S, Thakurta A, Papemoti N, Carlin N. Adversary instantiation: Lower bounds for differentially private machine learning. In2021 IEEE Symposium on Security and Privacy (SP) 2021 May 24 (pp. 866-882). IEEE.</p>
<p>[35] Jagielski M, Ullman J, Oprea A. Auditing differentially private machine learning: How private is private sgd?. Advances in Neural Information Processing Systems. 2020;33:22205-16.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Authors of [<a href="https://arxiv.org/abs/2108.07258">18</a>] framed these self-supervised models which are trained on broad data at scale that are adaptable to a wide range of downstream tasks as “foundation models.” <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The idea of privately fine-tuning a publicly pre-trained model certainly isn’t new. One of the first differentially private deep learning papers [<a href="https://arxiv.org/abs/1607.00133">19</a>] considered an experiment which fine-tuned convolutional nets on CIFAR-10 which were pre-trained on CIFAR-100. Results on privately fine-tuning <em>self-supervised</em> models are, on the other hand, more recent. Covering these results is our main focus here. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Blue and pink sphere avatars taken from [<a href="https://arxiv.org/abs/2108.07258">18</a>]. Credit to <a href="https://cs.stanford.edu/~dorarad/">Drew A. Hudson</a> for making these. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Unpublished result. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Hyper-parameters that work well for non-private learning typically aren’t those that work best for differentially private learning [<a href="https://openreview.net/pdf?id=rJg851rYwH">27</a>]. It’s crucial to use a large batch size, a small clipping norm, an appropriate learning rate, and a reasonably large number of training epochs to obtain the mentioned private learning results [<a href="https://arxiv.org/abs/2110.05679">11</a>]. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>In practice, past works have presented mixed results on whether larger models would yield better performance. While some showed that using more filters in a convolutional network can degrade the performance of private learning after some threshold [<a href="https://openreview.net/pdf?id=rJg851rYwH">27</a>], others showed that a larger model can outperform a smaller model from a different model family [<a href="https://arxiv.org/abs/2011.11660">4</a>]. Note these results are conditioned on their particular hyperparameter choices. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>Since the pre-training data for large language models are oftentimes collected through large scale web scraping (e.g., WebText), a common concern is that some training and test instances for downstream tasks may already appear in the pre-training data. Self-supervised pre-training therefore can give models an opportunity to “see” this data even before they are privately fine-tuned. Authors of [<a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">17</a>] confirmed that there is a 1-6% overlap between the test set of many natural language processing tasks and the pre-training data they collected (WebText); these common tasks, however, don’t include those studied by authors of [<a href="https://arxiv.org/abs/2110.05679">11</a>]. The numbers suggest a possibility that existing private fine-tuning results in the literature could be slightly inflated compared to when the pre-training data didn’t contain any instance for any downstream task for which evaluation was performed. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Xuechen LiFlorian TramèrJanardhan KulkarniTatsunori HashimotoTue, 15 Mar 2022 11:00:00 -0800
https://differentialprivacy.org/dp-fine-tuning/
https://differentialprivacy.org/dp-fine-tuning/A simple recipe for private synthetic data generation<p>In the <a href="https://differentialprivacy.org/synth-data-0/">last blog post</a>, we covered the potential pitfalls of synthetic data without formal privacy guarantees, and motivated the need for differentially private synthetic data mechanisms. In this blog post, we will describe the <strong>select-measure-generate</strong> paradigm, which is a simple and effective template for designing synthetic data mechanisms. The three steps underlying the select-measure-generate paradigm are illustrated and explained below.</p>
<p><img src="/images/select-measure-reconstruct.png" alt="" /></p>
<ol>
<li><strong>Select</strong> a collection of queries to measure — typically low-dimensional marginals.</li>
<li><strong>Measure</strong> the selected queries privately using a noise-addition mechanism.</li>
<li><strong>Generate</strong> synthetic data that best explains the noisy measurements.<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote" rel="footnote">1</a></sup></li>
</ol>
<p>Mechanisms in this class differ primarily in their methodology for selecting queries and their algorithm for generating synthetic data from noisy measurements. The focus of this blog post is the final <strong>Generate</strong> step. Specifically, we will explore different ways in which one can model data distributions for the purpose of generating synthetic data, outlining the qualitative pros and cons of each method. We will then introduce the <strong><a href="https://github.com/ryan112358/private-pgm" target="_blank">Marginal-Based Inference (MBI)</a></strong> repository that provides methods that, given some set of noisy measurements, enables users to generate synthetic data in a generic and scalable way.</p>
<p>Separating the Generate subroutine from the existing synthetic data generation mechanisms greatly simplifies the design space of new differentially private mechanisms. It allows the mechanism designer to focus on <em>selecting the queries</em> to maximize utility of the synthetic data, rather than <em>how to generate synthetic data</em> that explain the noisy measurements well. Both are challenging technical problems that require different techniques to solve, and MBI provides principled solutions to the latter problem, while exposing an interface that can be readily adopted by mechanism designers.</p>
<h1 id="the-generate-subproblem-a-unifying-view">The Generate Subproblem: A Unifying View</h1>
<p>In this section we will introduce the main optimization problem that underlies several methods for the Generate subproblem, and provide a high-level overview of how each method attempts to solve this optimization problem. Let \( y = \mathcal{M}(D) \) be the noisy measurements obtained from running a privacy mechanism on a discrete dataset \( D \). Our goal is to post-process these noise measurements to obtain synthetic data that explains them well. In particular, we wish to minimize over the space of all <em>datasets</em> for one that maximizes the likelihood of the observations \( y \).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup></p>
<p>\[ \hat{D} \in \text{arg} \max_{D \in \mathcal{D}} \log \mathbb{P}[\mathcal{M}(D) = y] \]</p>
<p>This is a high-dimensional discrete optimization problem, and is generally intractable to solve in practice, even in low-dimensional settings. It is common to consider the relaxed problem that instead optimizes over the set of <em>probability distributions</em> \( \mathcal{S} \):</p>
<p>\[ \hat{P} \in \text{arg} \max_{P \in \mathcal{S}} \log \mathbb{P}[\mathcal{M}(P) = y] \label{eq1} \tag{1} \]</p>
<p>More generally, we can consider any objective function that measures how well \( P \) explains \( y \). The log-likelihood is a natural choice, although other choices are also possible and used in practice. In the special-but-common case where the mechanism is an instance of the Gaussian mechanism, we have \( \mathcal{M}(D) = f(D) + \mathcal{N}(0, \sigma^2)^k \) and \( \log \mathbb{P}[\mathcal{M}(P) = y] \propto - || f(P) - y ||_2^2 \). If \( f \) is a linear function of \( P \), then Problem \ref{eq1} is simply a quadratic program. In the subsequent subsections, we will describe different approaches to solve or approximately solve Problem \ref{eq1}.</p>
<blockquote>
<p><strong>Remark 1</strong>: The distribution learned from solving Problem \ref{eq1} will resemble the true data with respect to the statistics measured by \( \mathcal{M} \). It may or may not accurately preserve other statistics — that is data dependent.</p>
</blockquote>
<blockquote>
<p><strong>Remark 2</strong>: The most common statistics to measure are <strong>low-dimensional marginals</strong>. A marginal for a subset of attributes counts the number of records in the dataset that match each setting of possible values. They are appealing statistics to measure because:</p>
<ul>
<li>They capture low-dimensional structure common in real world data distributions.</li>
<li>Each cell in a marginal is a count, a statistic that is fairly robust to noise.</li>
<li>One individual can only contribute to a single cell of a marginal, so all cells have low sensitivity and can be measured simultaneously with low privacy cost.</li>
</ul>
</blockquote>
<h3 id="direct">Direct</h3>
<p>We can attempt to solve Problem \ref{eq1} directly by utilizing any algorithm for convex optimization over the probability simplex, such as multiplicative weights. This method works well in low-dimensional regimes, although quickly becomes intractable for higher-dimensional domains, where it is generally intractable to even enumerate all the entries of a single distribution \( P \), let alone optimize over the space of all distributions.</p>
<p>Until recently, variants of the direct method were the only general-purpose solutions available for this problem, and as a result, many mechanisms struggled to scale to high-dimensional domains. Recently, several methods have been proposed that attempt to overcome the curse of dimensionality inherent in the direct approach, which scale by imposing additional assumptions on the mechanism \( \mathcal{M} \) and/or by relaxing the optimization problem. A common theme is to restrict attention to a subset of joint distributions which have tractable representations. The sections below describe these more scalable methods, including the different (implicit) assumptions each method makes, as well as the consequences of those assumptions.</p>
<h3 id="probabilistic-graphical-models-pgm">Probabilistic Graphical Models (PGM)</h3>
<p>The first method we describe is <a href="https://arxiv.org/abs/1901.09136" target="\_blank">PGM</a>, which was a key component of the first-place solution in the 2018 NIST Differential Privacy <a href="https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic" target="\_blank">Synthetic Data Competition</a> and in both the first and second-place solutions in the follow-up <a href="https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/current-and-upcoming-prize-challenges/2020-differential" target="\_blank">Temporal Map Competition</a>.</p>
<p>PGM scales by restricting attention to distributions that can be represented as a graphical model \( P_{\theta} \). The key observation of PGM is that when \( \mathcal{M} \) only depends on \( P \) through its low-dimensional marginals, then one of the optimizers of Problem \ref{eq1} is a graphical model with parameters \( \theta \). In this case, Problem \ref{eq1} is under-determined and typically has infinitely many solutions. It turns out that the solution found by PGM has maximum entropy among all solutions to the problem — a very natural way to break ties among equally good solutions. Remarkably, these facts are true for any dataset — they do not require the underlying data to be generated from a graphical model with the same structure <a href="https://arxiv.org/abs/2108.04978" target="\_blank">[MMS21]</a>.</p>
<p>The parameter vector \( \theta \) is often much smaller than \( P \), and we can efficiently optimize it, bypassing the curse of dimensionality in this special case. The size of \( \theta \) and in turn the complexity of PGM depends on the mechanism \( \mathcal{M} \), and in the worst case is the same as the Direct method.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">3</a></sup> However, in many common cases of practical interest, the complexity of PGM is exponentially better than that of Direct, in which case we can efficiently solve the optimization problem above, finding \( \theta \) and thus a tractable representation of \( \hat{P} \). The complexity ultimately depends on the size of the junction tree derived from the mechanism \( \mathcal{M} \), and understanding this relationship requires some expertise in graphical models. However, if we utilize this understanding to design \( \mathcal{M} \), we can avoid this worst-case behavior, as <a href="https://arxiv.org/abs/2108.04978" target="\_blank">MST</a> and <a href="http://vldb.org/pvldb/vol14/p2190-cai.pdf" target="\_blank">PrivMRF</a> do.</p>
<h3 id="relaxed-tabular">Relaxed Tabular</h3>
<p>An alternative approach was proposed in the recent <a href="https://arxiv.org/abs/2103.06641" target="\_blank">RAP</a> paper. The key idea is to restrict attention to “pseudo-distributions” that can be represented in a relaxed tabular format. The format is similar to the one-hot encoding of a discrete dataset, although the entries need not be \( 0 \) or \( 1 \), which enables gradient-based optimization to be performed on the cells in this table. The number of rows is a tunable knob that can be set to trade off expressive capacity with computational efficiency. With a sufficiently large knob size, the true minimizer of the original problem can be expressed in this way, but there is no guarantee that gradient-based optimization will converge to it because this representation introduces non-convexity. Moreover, the search space of this method includes “spurious” distributions, so even the global optimum of relaxed problem would not necessarily solve the original problem.<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">4</a></sup> Despite these drawbacks, this method appears to work well in practice.</p>
<h3 id="generative-networks">Generative Networks</h3>
<p>Among the iterative methods introduced by <a href="https://arxiv.org/abs/2106.07153" target="\_blank">[LVW21]</a> is GEM (Generative networks with the exponential mechanism), an approach inspired by generative adversarial networks. They propose representing any dataset as mixture of product distributions over attributes in the data domain. They implicitly encode such distributions using a generative neural network with a softmax layer. In concrete terms, given some Gaussian noise \( \mathbf{z} \sim \mathcal{N}(0, I) \), their <strong>Generate</strong> step outputs \( f_\theta(\mathbf{z}) \) where \( f \) is some feedforward neural network parametrized by \( \theta \). \( f_\theta(\mathbf{z}) \) represents a collection of marginal distributions for each individual attribute in the domain, which can be used to directly answer any k-way marginal query. Alternatively, one can sample directly from \( f_\theta(\mathbf{z}) \) if the goal is generate synthetic tabular data.</p>
<p>Note that the size of \( \mathbf{z} \) can be arbitrarily large, meaning that this generative network approach can theoretically be scaled up to capture any distribution \( P \). Moreover, <a href="https://arxiv.org/abs/2106.07153">[LVW21]</a> show that one can achieve strong performance in practical settings even when \( \mathbf{z} \) is small, making such generative network approaches to scale in terms of both computation and memory. Howevever, as is commonly found in deep learning methods, this optimization problem is nonconvex.</p>
<h3 id="local-consistency">Local Consistency</h3>
<p>Finally, <a href="https://arxiv.org/abs/2106.07153" target="\_blank">GUM</a> and <a href="https://arxiv.org/abs/2109.06153" target="\_blank">APPGM</a> do not search over any space of distributions, but instead impose <em>local consistency</em> constraints on the noisy measurements. These methods relax Problem \ref{eq1} to optimize over the space of pseudo-marginals, rather than distributions. The pseudo-marginals are required to be internally consistent, but there is no guarantee that there is a distribution which realizes those pseudo-marginals. As a result, the solution found by these methods need not be feasible in Problem \ref{eq1}. Nevertheless, we can attempt to generate synthetic data using heuristics to translate these locally consistent pseudo-marginals into synthetic tabular data. This approach was used by team DPSyn in both NIST competitions.</p>
<h3 id="summary">Summary</h3>
<p>A qualitative comparison between the discussed methods is given in the table below.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">5</a></sup></p>
<blockquote>
<p><strong>Remark 3</strong>: Among the alternatives discussed here, only Direct and PGM can be expected to solve Problem \ref{eq1}. The alternatives fail to solve Problem \ref{eq1} in general, either from non-convexity, or from introducing spurious distributions to the search space. This distinguishing feature of PGM comes at a cost: the complexity can be much higher than the alternatives, and in the worst-case, will not be feasible to run. In such cases, one of the approximations must be used instead.</p>
</blockquote>
<table>
<tbody>
<tr>
<td> </td>
<td><strong>Direct</strong></td>
<td><strong>PGM</strong></td>
<td><strong>Relaxed Tabular</strong></td>
<td><strong>Generative Networks</strong></td>
<td><strong>Local Consistency</strong></td>
</tr>
<tr>
<td>Search space includes optimum</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
</tr>
<tr>
<td>Search space excludes spurious distributions</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
</tr>
<tr>
<td>Convexity preserving</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:green">Yes</span></td>
</tr>
<tr>
<td>Solves Problem \ref{eq1}</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
</tr>
<tr>
<td>Factors influencing scalability</td>
<td><span style="color:red">Size of Entire Domain</span></td>
<td><span style="color:orange">Size of Junction Tree</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
</tr>
</tbody>
</table>
<h1 id="generating-synthetic-data-with-mbi">Generating Synthetic Data with MBI</h1>
<p>Now that we have introduced the techniques underlying the Generate step, we will show how to utilize the implementations in the MBI repository to develop end-to-end mechanisms for differentially private synthetic data.</p>
<h2 id="preparing-noisy-measurements">Preparing Noisy Measurements</h2>
<p>The input to any method for Generate is a collection of noisy measurements. We show below how to prepare these measurements in a format compatible with the methods for Generate implemented in the MBI repository. The measurements are represented as a list, where each element of the list is a noisy marginal (represented as a numpy array), along with relevant metadata including the attributes in the marginal and the amount of noise used to answer it. In the code snippet below, the selected marginals are hard-coded, but in general this list can be modified to tailor the synthetic data towards a different set of marginals.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">sparse</span>
<span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'adult.csv'</span><span class="p">,</span> <span class="s">'adult-domain.json'</span><span class="p">)</span>
<span class="c1"># SELECT the marginals we'd like to measure
</span><span class="n">marginals</span> <span class="o">=</span> <span class="p">[(</span><span class="s">'marital-status'</span><span class="p">,</span> <span class="s">'sex'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'education-num'</span><span class="p">,</span> <span class="s">'race'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'sex'</span><span class="p">,</span> <span class="s">'hours-per-week'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'workclass'</span><span class="p">,),</span>
<span class="p">(</span><span class="s">'marital-status'</span><span class="p">,</span> <span class="s">'occupation'</span><span class="p">,</span> <span class="s">'income>50K'</span><span class="p">)]</span>
<span class="c1"># MEASURE the marginals and log the noisy answers
</span><span class="n">sigma</span> <span class="o">=</span> <span class="mi">50</span>
<span class="n">measurements</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">M</span> <span class="ow">in</span> <span class="n">marginals</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">project</span><span class="p">(</span><span class="n">M</span><span class="p">).</span><span class="n">datavector</span><span class="p">()</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">I</span> <span class="o">=</span> <span class="n">sparse</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">)</span>
<span class="n">measurements</span><span class="p">.</span><span class="n">append</span><span class="p">(</span> <span class="p">(</span><span class="n">I</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">M</span><span class="p">)</span> <span class="p">)</span></code></pre></figure>
<p>The above code snippet is a 5-fold composition of Gaussian mechanisms with \( \sigma = 50 \), and hence the entire mechanism is \( \frac{5}{2 \sigma^2} = \frac{1}{1000} \)-zCDP.</p>
<h2 id="generating-synthetic-data-from-measurements">Generating Synthetic Data from Measurements</h2>
<p>Given measurements represented in the format above, we can readily generate synthetic data using one of several methods. For example, the code snippet below generates synthetic data that approximately matches the noisy measurements:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">FactoredInference</span> <span class="c1"># PGM
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">MixtureInference</span> <span class="c1"># Relaxed Tabular + Softmax
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">LocalInference</span> <span class="c1"># Local Consistency
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">PublicInference</span> <span class="c1"># Not Discussed
</span>
<span class="c1"># GENERATE synthetic data using PGM
</span><span class="n">engine</span> <span class="o">=</span> <span class="n">FactoredInference</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">domain</span><span class="p">,</span> <span class="n">iters</span><span class="o">=</span><span class="mi">2500</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">engine</span><span class="p">.</span><span class="n">estimate</span><span class="p">(</span><span class="n">measurements</span><span class="p">)</span>
<span class="n">synth</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">synthetic_data</span><span class="p">()</span></code></pre></figure>
<p>To generate synthetic data, we have to simply instantiate one of the inference engines imported. In the code snippet above, we use the FactoredInference engine, which corresponds to the PGM method. The other inference engines share the same interface, and can be used instead if desired.</p>
<blockquote>
<p><strong>Remark 4</strong>: By utilizing the inference engines implemented in MBI, end-to-end synthetic data mechanisms can be written with remarkably little code. This simple example required less than 25 lines of code, and <a href="https://github.com/ryan112358/private-pgm/tree/master/mechanisms" target="\_blank">more complex mechanisms</a> can usually be written in a single file with less than 200 lines of code. As a result, future research can focus on the measurement selection subproblem, and new ideas can more rapidly be evaluated and iterated on.</p>
</blockquote>
<p>We evaluated the quality of the synthetic data generated by measuring the error of the measured marginals. Interestingly, the synthetic data has lower error than the noisy marginals, with reductions in error up to 30% for the larger marginals, and around 3% for the smaller ones.</p>
<p><img src="/images/smr1.png" alt="" /></p>
<blockquote>
<p><strong>Remark 5:</strong> It is not surprising that the synthetic data enjoys lower error than the noisy marginals. Problem \ref{eq1} can be seen as a <em>projection problem</em>, and there is substantial theoretical <a href="https://arxiv.org/abs/1212.0297" target="\_blank">[NTZ12]</a> and empirical [<a href="https://dl.acm.org/doi/abs/10.1145/2783258.2783366" target="\_blank">LWK15</a>, <a href="https://systems.cs.columbia.edu/private-systems-class/papers/Abowd2019Census.pdf" target="\_blank">AAGK+19</a>] evidence that solving this problem reduces error. Intuitively, the benefit arises due to the inconsistencies in the noisy observations that are resolved through the optimization procedure.</p>
</blockquote>
<p>We can also use the synthetic data to estimate marginals we didn’t measure with the Gaussian mechanism. These estimates may or may not be accurate, it depends on the data and the marginal being estimated. For example, the error on the (sex, income>50K) marginal is around 0.02, while the error on the (education-num, occupation) marginal is about 0.5.</p>
<p><img src="/images/smr2.png" alt="" /></p>
<blockquote>
<p><strong>Remark 6:</strong> The fact that the synthetic data is not accurate for some marginals is not a limitation of the method used for Generate, but rather an artifact of what marginals were selected. Thus, it is clear that selecting the right marginals to measure plays a crucial role in the quality of the synthetic data. This is an important open problem that will be the topic of a future blog post.</p>
</blockquote>
<h1 id="coming-up-next">Coming up Next</h1>
<p>In this blog post, we focused on the <strong>Generate</strong> step of the select-measure-generate paradigm. For the next blog post in this series, we will focus on state-of-the-art approaches to the <strong>Select</strong> sub-problem. If you have any comments, questions, or remarks, please feel free to share them in the comments section below. If you would like to try generating synthetic data with MBI, check out this <a href="https://colab.research.google.com/drive/1c8gT5m_GWfQoa_mx8eXh4sPD48Y0z3ML?usp=sharing">jupyter notebook</a> on Google Colab!</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>The Generate step is a post-processing of already privatized noisy marginals, and therefore the privacy analysis only needs to reason about the first two steps. <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Here we assume that \( \mathcal{M} \) is a mechanism with a discrete output space. In practice, this is always the case because any mechanism implemented on a finite computer must have a discrete output space. For continuous output spaces, interpret the objective function as a log density rather than a log probability. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>For example, this worst-case behavior is realized if <strong>all</strong> 2-way marginals are measured. While this can be seen as a limitation of PGM, <a href="http://people.seas.harvard.edu/~salil/research/synthetic-Feb2010.pdf">it is known</a> that generating synthetic data that preserves all 2-way marginals is computationally hard in the worst-case. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>This idea was refined into <a href="https://arxiv.org/abs/2106.07153" target="\_blank">RAP<sup>softmax</sup></a> in follow-up-work, which overcomes the latter issue, but does not resolve the non-convexity issue. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>These approximations were all developed concurrently, and systematic empirical comparisons between them (and PGM) have not been done to date. Some experimental comparisons can be found in <a href="https://arxiv.org/abs/2106.07153" target="\_blank">[LVW21]</a> and <a href="https://arxiv.org/abs/2109.06153">[MPSM21]</a>. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Ryan McKennaTerrance LiuThu, 27 Jan 2022 09:00:00 -0800
https://differentialprivacy.org/synth-data-1/
https://differentialprivacy.org/synth-data-1/What is Synthetic Data?<p>The concept of synthetic data seems to be having “a moment” in the privacy world as a promising approach to sharing data while protecting privacy. Strictly speaking, any time you make up data, you have produced a synthetic dataset, but more specifically</p>
<blockquote>
<p>a <strong>synthetic dataset</strong> is a stand-in for some original dataset that has the same format, and accurately reflects the statistical properties of the original dataset, but contains only “fake” records.</p>
</blockquote>
<p>Intuitively, a synthetic dataset can be used as if it were the real data—we can stare at it, compute summary statistics from it, train models from it, and do everything we normally do with data—but because the records do not correspond to “real” people, we don’t have to worry about protecting privacy.</p>
<p>To be sure, synthetic data is a very appealing concept, but</p>
<blockquote>
<p>simply making data “synthetic” does not guarantee privacy in any meaningful sense of the word,</p>
</blockquote>
<p>and we need to be careful about what it actually means to generate <strong>private synthetic data.</strong></p>
<p>In this post I will briefly describe <strong>differentially private synthetic data</strong>, why we need it, and what we know about producing it. The post will serve as a warmup to some more technical posts that will follow later, describing general approaches and state-of-the-art algorithms for producing synthetic data.</p>
<h3 id="what-is-synthetic-data">What is synthetic data?</h3>
<p>Given a dataset \(X\), a synthetic dataset \(Y\) is a new dataset that has the same <em>structure</em> as \(X\), but whose elements are “fake.” For example, if \(X\) is a highly stylized version of the data collected in the US decennial census, then each element of \(X\) would be a tuple of the form (age, census_block, sex, race, ethnicity) corresponding to the data of a real person who filled out a census card. A synthetic dataset should also contain such tuples, but they no longer need to correspond to real people.</p>
<p>An example of a synthetic dataset for the census would be a dataset consisting of 300,000,000 individuals, all living at one location in rural Mississippi, and all of whom are white, non-hispanic females aged 81. And therein lies the challenge. Of course we want a synthetic dataset not to merely have the same <em>structure</em> as the original data, but also to preserve the <em>statistical properties</em> of the original data. A synthetic dataset for the census should put roughly the right number of individuals of the right types in the right places to reflect the actual population of the US.</p>
<blockquote>
<p>Thus, to generate synthetic data, we have to know something about the original data, which creates opportunities to leak sensitive information.</p>
</blockquote>
<h4 id="synthetic-data-from-generative-modeling">Synthetic data from generative modeling</h4>
<p>Ensuring that a synthetic dataset really only reflects the statistical properties of the real data, and doesn’t encode sensitive information about the individuals in the real dataset, is quite difficult. I could go on forever about strawman proposals that would generate records that appear to be “fake” but really just encode the original dataset in obvious ways. Just so you believe I could do it, suppose for every real record, we get a fake record by adding 10 years to that individual’s age. But what people typically mean when they talk about generating synthetic data, and what the synthetic data products being sold by companies typically do, is train a generative model for the dataset \(X\) and obtain the synthetic dataset \(Y\) by sampling from that model. The hope is that samples from the generative model look like the original dataset in aggregate, but the actual people in the sample have nothing to do with the real people in the original dataset. For example, a generative model from the US population will lead to more people living in California than in Montana, but ideally wouldn’t simply spit out the data of the actual residents of California.</p>
<h4 id="generative-models-dont-magically-protect-privacy">Generative models don’t magically protect privacy</h4>
<p>Simply training a generative model on \(X\) doesn’t actually mean we’ve hidden anything about \(X\) or the people in it. For an extreme example, suppose the records in \(X\) are \(x_1,\dots,x_n\) and our generative model is arbitrarily expressive and determines that the best model for the data is just the uniform distribution on the set \(x_1,\dots,x_n\). Then if our synthetic dataset consists of \(n\) iid samples from the generative model, our synthetic dataset will simply contain one or more copies of the real data for a large fraction (about 70%) of the people in the original dataset.</p>
<h4 id="reconstruction-attacks-say-hello">Reconstruction attacks say hello</h4>
<p>My example was admittedly a strawman, and when we train a generative model we typically expect it to produce new examples that aren’t in the original training dataset. Maybe by training generative models in the right way, we can ensure that the generative model only captures statistics about the dataset, without revealing information about the individuals in the dataset? Let’s not forget that</p>
<blockquote>
<p>a synthetic dataset is just one of many possible ways to release statistical information about a dataset, and we know from the theory of reconstruction attacks that <strong>any</strong> mechanism that accurately preserves too many statistics inevitably destroys privacy.</p>
</blockquote>
<p>Specifically, there is a rich and fairly comprehensive theory of <strong>reconstruction attacks</strong>, whose <a href="/reconstruction-theory/">theory</a> and <a href="/diffix-attack/">practice</a> we detailed in earlier posts. This theory says that any synthetic dataset that preserves answers to every statistic of the form “How many elements of the dataset satisfies property P?” with any meaningful notion of accuracy, is spectacularly non-private, in the sense that an attacker can reconstruct nearly all of the private information in the original dataset. Viewed through this lens, synthetic data becomes something of a red herring. The important question is which subset of statistics we want to preserve and how to preserve them without compromising privacy.</p>
<h3 id="differentially-private-synthetic-data">Differentially private synthetic data</h3>
<p>Making a dataset “synthetic” may not be a magic bullet, but it’s still a useful goal. Fortunately, we already have a good working definition of what it means for an algorithm to protect the privacy of the individuals in the dataset—<strong>differential privacy</strong>. So perhaps we can design a differentially private algorithm that takes the original dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves many of the properties of \(X\) accurately (but still within the limits imposed by reconstruction attacks)? In fact, the answer turns out to be a resounding yes!</p>
<blockquote>
<p>There is a differentially private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves an <strong>exponential</strong> number of statistical properties of \(X\) <a href="https://arxiv.org/abs/1109.2229">[BLR08]</a>.</p>
</blockquote>
<p>This statement is intentionally quite informal, but intutively this result says that we can hope to do amazing things, like generate a synthetic dataset that satisfies the strong guarantee of differential privacy while still approximately preserving impressively complex statistics of the dataset, such as the marginal distribution of every set of three attributes, or the prediction error of every linear classifier.</p>
<p>So what’s the problem? Well, primarily it’s computational complexity. This algorithm, and the many beautiful algorithms it inspired, all have exponential worst-case running time. Specifically, if each element of the dataset contains \(d\) bits, then the worst-case running time is at least as large as \(2^d\). Unfortunately, this turns out to be an inherent bottleneck for any algorithm that generates synthetic data.</p>
<blockquote>
<p>Any private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves even just the correlations between each pair of features must have worst-case running time that grows exponentially in the number of features, under widely believed complexity assumptions. <a href="https://eccc.weizmann.ac.il/report/2010/017/">[UV11]</a></p>
</blockquote>
<p>While there are inherent limits of the accuracy of all differentially private algorithms (see <a href="https://privacytools.seas.harvard.edu/publications/exposed-survey-attacks-private-data">[DSSU17]</a>), and there are significant barriers to making differentially private algorithms practical,</p>
<blockquote>
<p>computational complexity is the main barrier that is specific to differentially private algorithms for generating synthetic data.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</blockquote>
<h4 id="where-do-we-go-from-here">Where do we go from here?</h4>
<p>Despite the computational bottlenecks, there has stll been a lot of amazing progress on differentially private synthetic data. And, despite my criticism of generative modeling in isolation as a means of generating synthetic data, most of these approaches are indeed based on <em>differentially private generative models</em> such as sparse graphical models or GANs, and leverage the fact that these types of models can typically be fit efficiently to realistic data, despite their worst-case hardness. In the next few posts, we will try to describe the landscape of the most promising approaches that we currently have available.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>See <a href="https://arxiv.org/abs/2110.13239">[AACGKLSSTZ21]</a> for an interesting example of a <em>statistical</em> limitation that is specific to differentially private algorithms that generate synthetic data. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Jonathan UllmanMon, 10 Jan 2022 15:00:00 -0800
https://differentialprivacy.org/synth-data-0/
https://differentialprivacy.org/synth-data-0/Conference Digest - NeurIPS 2021<p>The accepted papers for <a href="https://neurips.cc/Conferences/2020">NeurIPS 2021</a> were recently announced, and there’s a huge amount of differential privacy content.
We found one relevant workshop and 48 papers.
This is up from 31 papers last year, an over 50% increase!
It looks like there’s huge growth in interest on differentially private machine learning.
Impressively, at the time of this writing, all but five papers are already posted on arXiv!
For the full list of accepted papers, see <a href="https://neurips.cc/Conferences/2021/AcceptedPapersInitial">here</a>.
Please let us know if we missed relevant papers on differential privacy!</p>
<h2 id="workshops">Workshops</h2>
<ul>
<li><a href="https://priml2021.github.io/">Privacy in Machine Learning (PriML) 2021</a></li>
</ul>
<h2 id="papers">Papers</h2>
<ul>
<li>
<p><a href="https://arxiv.org/abs/2103.08721">A Central Limit Theorem for Differentially Private Query Answering</a><br />
Jinshuo Dong, Weijie Su, Linjun Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02391">Adapting to function difficulty and growth conditions in private optimization</a><br />
Hilal Asi, Daniel Levy, John Duchi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.04378">Adaptive Machine Unlearning</a><br />
Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, Chris Waites</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.13239">An Uncertainty Principle is a Price of Privacy-Preserving Microdata</a><br />
John Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Daniel Kifer, Philip Leclerc, William Sexton, Ashley Simpson, Christine Task, Pavel Zhuravlev</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03408">Antipodes of Label Differential Privacy: PATE and ALIBI</a><br />
Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, Florian Tramer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13329">Covariance-Aware Private Mean Estimation Without Private Covariance Estimation</a><br />
Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, Lydia Zakynthinou</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.06062">Deep Learning with Label Differential Privacy</a><br />
Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.05855">Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent</a><br />
Rishav Chourasia, Jiayuan Ye, Reza Shokri</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02674">Differentially Private Empirical Risk Minimization under the Fairness Lens</a><br />
Cuong Tran, My Dinh, Ferdinando Fioretto</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.14153">Differentially Private Federated Bayesian Optimization with Distributed Exploration</a><br />
Zhongxiang Dai, Bryan Kian Hsiang Low, Patrick Jaillet</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1905.03871">Differentially Private Learning with Adaptive Clipping</a><br />
Galen Andrew, Om Thakkar, Swaroop Ramaswamy, Brendan McMahan</p>
</li>
<li>
<p>Differentially Private Model Personalization<br />
Prateek Jain, John Rush, Adam Smith, Shuang Song, Abhradeep Guha Thakurta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02900">Differentially Private Multi-Armed Bandits in the Shuffle Model</a><br />
Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02831">Differentially Private n-gram Extraction</a><br />
Kunho Kim, Sivakanth Gopi, Janardhan Kulkarni, Sergey Yekhanin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02516">Differential Privacy Over Riemannian Manifolds</a><br />
Matthew Reimherr, Karthik Bharath, Carlos Soto</p>
</li>
<li>
<p>Differentially Private Sampling from Distributions<br />
Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith, Marika Swanberg</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.05585">Differentially Private Stochastic Optimization: New Results in Convex and Non-Convex Settings</a><br />
Raef Bassily, Cristóbal Guzmán, Michael Menart</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.01177">Don’t Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence</a><br />
Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, Karsten Kreis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.09063">Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization</a><br />
Pranav Subramani, Nicholas Vadivelu, Gautam Kamath</p>
</li>
<li>
<p>Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms<br />
Jeremy Seeman, Matthew Reimherr, Aleksandra Slavković</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.03013">Fast and Memory Efficient Differentially Private-SGD via JL Projections</a><br />
Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, Uthaipon Tantipongpipat</p>
</li>
<li>
<p>G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators<br />
Yunhui Long, Boxin Wang, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl Gunter, Bo Li</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03365">Generalized Linear Bandits with Local Differential Privacy</a><br />
Yuxuan Han, Zhipeng Liang, Yang Wang, Jiheng Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2008.11193">Individual Privacy Accounting via a Rényi Filter</a><br />
Vitaly Feldman, Tijana Zrnic</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2104.00979">Information-constrained optimization: can adaptive processing of gradients help?</a><br />
Jayadev Acharya, Clement Canonne, Prathamesh Mayekar, Himanshu Tyagi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00463">Instance-optimal Mean Estimation Under Differential Privacy</a><br />
Ziyue Huang, Yuting Liang, Ke Yi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.07153">Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods</a><br />
Terrance Liu, Giuseppe Vietri, Steven Wu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.11845">Learning with User-Level Privacy</a><br />
Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13513">Littlestone Classes are Privately Online Learnable</a><br />
Noah Golowich, Roi Livni</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.07778">Local Differential Privacy for Regret Minimization in Reinforcement Learning</a><br />
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.03940">Locally differentially private estimation of functionals of discrete distributions</a><br />
Cristina Butucea, Yann Issartel</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.10675">Locally private online change point detection</a><br />
Tom Berrett, Yi Yu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.10870">Multiclass versus Binary Differentially Private PAC Learning</a><br />
Satchit Sivakumar, Mark Bun, Marco Gaboardi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02848">Numerical Composition of Differential Privacy</a><br />
Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.11526">On the Sample Complexity of Privately Learning Axis-Aligned Rectangles</a><br />
Menachem Sadigurschi, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03645">Photonic Differential Privacy with Direct Feedback Alignment</a><br />
Ruben Ohana, Hamlet Medina, Julien Launay, Alessandro Cappelli, Iacopo Poli, Liva Ralaivola, Alain Rakotomamonjy</p>
</li>
<li>
<p>Private and Non-private Uniformity Testing for Ranking Data<br />
Róbert Busa-Fekete, Dimitris Fotakis, Emmanouil Zampetakis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.07171">Private learning implies quantum stability</a><br />
Yihui Quek, Srinivasan Arunachalam, John A Smolin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.15352">Private Non-smooth ERM and SCO in Subquadratic Steps</a><br />
Janardhan Kulkarni, Yin Tat Lee, Daogao Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02162">Privately Learning Mixtures of Axis-Aligned Gaussians</a><br />
Ishaq Aden-Ali, Hassan Ashtiani, Christopher Liaw</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00001">Privately Learning Subspaces</a><br />
Vikrant Singhal, Thomas Steinke</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02281">Privately Publishable Per-instance Privacy</a><br />
Rachel Redberg, Yu-Xiang Wang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2109.06153">Relaxed Marginal Consistency for Differentially Private Query Answering</a><br />
Ryan McKenna, Siddhant Pradhan, Daniel Sheldon, Gerome Miklau</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.03279">Remember What You Want to Forget: Algorithms for Machine Unlearning</a><br />
Ayush Sekhari, Jayadev Acharya, Gautam Kamath, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.08763">Renyi Differential Privacy of The Subsampled Shuffle Model In Distributed Learning</a><br />
Antonious Girgis, Deepesh Data, Suhas Diggavi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.09159">Robust and differentially private mean estimation</a><br />
Xiyang Liu, Weihao Kong, Sham Kakade, Sewoong Oh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.04995">The Skellam Mechanism for Differentially Private Federated Learning</a><br />
Naman Agarwal, Peter Kairouz, Ken Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.11208">User-Level Differentially Private Learning via Correlated Sampling</a><br />
Badih Ghazi, Ravi Kumar, Pasin Manurangsi</p>
</li>
</ul>
Gautam KamathTue, 09 Nov 2021 11:00:00 -0400
https://differentialprivacy.org/neurips2021/
https://differentialprivacy.org/neurips2021/How to deploy machine learning with differential privacy?<p>In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.</p>
<h2 id="why-do-we-need-private-machine-learning-algorithms">Why do we need private machine learning algorithms?</h2>
<p>Machine learning algorithms work by studying a lot of data and updating their parameters to encode the relationships in that data. Ideally, we would like the parameters of these machine learning models to encode general patterns (e.g., ‘‘patients who smoke are more likely to have heart disease’’) rather than facts about specific training examples (e.g., “Jane Smith has heart disease”). Unfortunately, machine learning algorithms do not learn to ignore these specifics by default. If we want to use machine learning to solve an important task, like making a cancer diagnosis model, then when we publish that machine learning model (for example, by making an open source cancer diagnosis model for doctors all over the world to use) we might also inadvertently reveal information about the training set. A malicious attacker might be able to inspect the published model’s predictions and learn private information about Jane Smith. For instance, the adversary could mount a membership inference attack to know whether or not Jane Smith contributed her data to the model’s training set [SSS17]. The adversary could also build on membership inference attacks to extract training data by repeatedly guessing possible training points until they result in a sufficiently strong membership signal from the model’s prediction [CTW20]. In many instances, the model itself may be represented by a few of the data samples (e.g., Support Vector Machine in its dual form).</p>
<p>A common misconception is that if a model generalizes (i.e., performs well on the test examples), then it preserves privacy. As mentioned earlier, this is far from being true. One of the main reasons being that generalization is an average case behavior of a model (over the distribution of data samples), whereas privacy must be provided for everyone, including outliers (which may deviate from our distributional assumptions).</p>
<p>Over the years, researchers have proposed various approaches towards protecting privacy in learning algorithms (k-anonymity [SS98], l-diversity [MKG07], m-invariance [XT07], t-closeness [LLV07] etc.). Unfortunately, [GKS08] all these approaches are vulnerable to what are called composition attacks, that use auxiliary information to violate the privacy protection. Famously, this strategy allowed researchers to de-anonymize part of a movie ratings dataset released to participants of the Netflix Prize when the individuals had also shared their movie ratings publicly on the Internet Movie Database (IMDb) [NS08]. If Jane Smith had assigned the same ratings to movies A, B and C in the Netflix Prize dataset and publicly on IMDb at similar times, then researchers could link data corresponding to Jane across both datasets. This would in turn give them the means to recover ratings that were included in the Netflix Prize but not on IMDb. This example shows how difficult it is to define and guarantee privacy because it is hard to estimate the scope of knowledge—about individuals—available to adversaries. While the dataset released by Netflix has since been taken down, it is difficult to ensure that all of its copies have been deleted. In recent years, data sample instance encoding based methods like InstaHide [HSL20], and NeuraCrypt [YEO21] have been demonstrated to be vulnerable to such composition attacks as well.</p>
<p>As a result, the research community has converged on differential privacy [DMNS06], which provides the following semantic guarantee, as opposed to ad-hoc approaches: An adversary learns almost the same information about an individual whether or not they are present in or absent from the training data set. In particular, it provides a condition on the algorithm, independent from who might be attacking it, or the specifics of the data set instantiation. Put another way, differential privacy is a framework for evaluating the guarantees provided by a system that was designed to protect privacy. Such systems can be applied directly to “raw” data which potentially still contains sensitive information, altogether removing the need for procedures that sanitize or anonymize data and are prone to the failures described previously. That said, minimizing data collection in the first place remains a good practice to limit other forms of privacy risk.</p>
<h2 id="designing-private-machine-learning-algorithms-via-differential-privacy">Designing Private Machine Learning Algorithms via Differential Privacy</h2>
<p>Differential privacy [DMNS06] is a semantic notion of privacy that addresses a lot of the limitations of previous approaches like k-anonymity. The basic idea is to randomize part of the mechanism’s behavior to provide privacy. In our case, the mechanism considered is a learning algorithm, but the differential privacy framework can be applied to study any algorithm.</p>
<p>The intuition for why we introduce randomness into the learning algorithm is that it obscures the contribution of an individual, but does not obscure important statistical patterns. Without randomness, we would be able to ask questions like: “What parameters does the learning algorithm choose when we train it on this specific dataset?” With randomness in the learning algorithm, we instead ask questions like: “What is the probability that the learning algorithm will choose parameters in this set of possible parameters, when we train it on this specific dataset?”</p>
<p>We use a version of differential privacy which requires (in our use case of machine learning) that the probability of learning any particular set of parameters stays roughly the same if we change a single data record in the training set. A data record can be a single training example from an individual, or the collection of all the training examples provided by an individual. The former is often referred to as example level/item level privacy, and the latter is referred to as user level differential privacy. While user level privacy provides stronger semantics, it may be harder to achieve. For a more thorough discussion about the taxonomy of these notions, see [DNPR10, JTT18, HR12, HR13]. In this document, for the ease of exposition of the technical results, we focus on the example level notion. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning much, then that patient’s records cannot be memorized and her privacy is respected. In the rest of this post, how much a single record can affect the outcome of learning is called the sensitivity of an algorithm.</p>
<p>The guarantee of differential privacy is that the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users. We also refer to the degree of indistinguishability as the privacy loss. Smaller privacy loss corresponds to a stronger privacy guarantee.</p>
<p>It is often thought that privacy is a fundamental bottleneck in obtaining good prediction accuracy/generalization for machine learning algorithms. In fact, recent research has shown that in many instances it actually helps in designing algorithms with strong generalization ability. Some of the examples where DP has resulted in designing better learning algorithms are Online linear predictions [KV05] and online PCA [DTTZ13]. Notably, [DFH15] formally showed that generalization for any DP learning algorithm comes for free. More concretely, if a DP learning algorithm has good training accuracy, it is guaranteed to have good test accuracy.
This is true because differential privacy itself acts as a very strong form of regularization.</p>
<p>One might argue that the generalization guarantee which a DP algorithm can achieve may be sub-par to that of its non-private baselines. For a large class of learning tasks, one can show that asymptotically DP does not introduce any further error beyond the inherent statistical error [SSTT21]. [ACG16,BFTT19] highlights that in the presence of enough data, a DP algorithm can get arbitrarily close to the inherent statistical error, even under strong privacy parameters.</p>
<h2 id="private-empirical-risk-minimization">Private Empirical Risk Minimization</h2>
<p>Before we go into the design of specific differentially private learning algorithms, we first formalize the problem setup, and standardize some notation. Consider a training data set \(D={(x_1,y_1),…,(x_n,y_n)}\) drawn i.i.d. from some fixed (unknown) distribution \(\Pi\), with the feature vector being \(x_i\) and label/response being \(y_i\). We define the training loss at any model \(\theta\) as \(L_{train} (\theta, D) = \frac{1}{n} \sum_{i=1}^{n} l(\theta; (x_i, y_i))\), and the corresponding test loss as \(L_{test} (\theta) = E_{(x,y) \sim \Pi} l(\theta;(x,y)) \).
We will design DP algorithms to output models that approximately minimize the test loss while having access only to the training loss.</p>
<p>In the literature, there are a variety of approaches towards designing these DP learning algorithms [CMS11, KST12, BST14, PAE16, BTT18]. One can categorize them broadly as: i) algorithms that assume that the individual loss function \(l(\theta;\cdot) \) is convex in the model parameter to ensure differential privacy, ii) algorithms that are differentially private even when the loss function is non-convex in nature (e.g., deep learning models), and iii) model agnostic algorithms, that do not require any information about the representation of the model \(\theta\), or the loss function \(l(\theta;\cdot) \). In our current discussion, we will only focus on designing algorithms for (ii), and (iii). This is because it turns out that the best known algorithms for (ii) are already competitive to algorithms that are specific for (i) [INS19].</p>
<h2 id="private-algorithms-for-training-deep-learning-models">Private Algorithms for Training Deep Learning Models</h2>
<p>The first approach, due to SCS13, BST14, and ACG16, is named differentially private stochastic gradient descent (DP-SGD). It proposes to modify the model updates computed by the most common optimizer used in deep learning: stochastic gradient descent (SGD). Typically, stochastic gradient descent trains iteratively. At each iteration, a small number of training examples (a “minibatch”) are sampled from the training set. The optimizer computes the average model error on these examples, and then differentiates this average error with respect to each of the model parameters to obtain a gradient vector. Finally, the model parameters (\(\theta_t\)) are updated by subtracting this gradient (\(\nabla_t\)) multiplied by a small constant \(\eta\) (the learning rate controls how quickly the optimizer updates the model’s parameters). At a high level, two modifications are made by DP-SGD to obtain differential privacy: gradients, which are computed on a per-example basis (rather than averaged over multiple examples), are first clipped to control their sensitivity, and, second, spherical Gaussian noise \(b_t\) is added to their sum to obtain the indistinguishability needed for DP. Succinctly, the update step can be written as follows: \(\theta_{t+1} \leftarrow \theta_t - \eta \cdot (\nabla_t + b_t)\).</p>
<p>Let us take the example of a hospital training a model to predict whether patients will be readmitted after being released. To train the model, the hospital uses information from patient records, such as demographic variables and admission variables (e.g., age, ethnicity, insurance type, type of Intensive Care Unit admitted to) but also time-varying vitals and labs (e.g., heart rate, blood pressure, white blood cell counts) [JPS16]. The modifications made by DP-SGD ensure that if (1) Jane Smith’s individual patient record contained unusual features, e.g., her insurance provider was uncommon for people of her age or her heart rate followed an unusual pattern, the resulting signal will have a bounded impact on our model updates, and (2) the model’s final parameters would be essentially identical should Jane Smith have chosen to not contribute (i.e., opt-out) her patient record to the training set. Stronger differential privacy is achieved when one is able to introduce more noise (i.e., sample noise with larger standard deviation) and train for as few iterations as possible.</p>
<p>Two main components in the above DP-SGD algorithm that distinguishes itself from traditional SGD are: i) per-example clipping and ii) Gaussian noise addition. In addition, for the analysis to hold, DP-SGD requires that sub-sampling of mini batches is uniform at random from the training data set. While this is not a requirement of DP-SGD per se, in practice many implementations of SGD do not satisfy this requirement and instead analyze different permutations of the data at each epoch of training.</p>
<p>While gradient clipping is common in deep learning, often used as a form of regularization, it differs from that in DP-SGD as follows: The average gradient over the minibatch is clipped, as opposed to clipping the gradient of individual examples (i.e., \(l(\theta_t;(x,y)) \) before averaging. It is an ongoing research direction to both understand the effect of per-example clipping in DP-SGD in model training [SSTT21], and also effective ways to mitigate its impact both in terms of accuracy [PTS21], and training time [ZHS19].</p>
<p>In standard stochastic gradient descent, subsampling is usually used either as a way to speed up the training process [CAR16], or as a a form of regularization [RCR15]. In DP-SGD, the randomness in the subsampling of the minibatch is used to guarantee DP. The technical component for this sort of privacy analysis is called privacy amplification by subsampling [KLNRS08,BBG18]. Since the sampling randomness is used to guarantee DP, it is crucial that the uniformity in the sampling step is of cryptographic strength. Another, (possibly) counterintuitive feature of DP-SGD is that for best privacy/utility trade-off it is in general better to have larger batch sizes. In fact, full-batch DP-gradient descent may provide the best privacy/utility trade-offs, albeit at the expense of computational feasibility.</p>
<p>For a fixed DP guarantee, the magnitude of the Gaussian noise that gets added to the gradient updates in each step in DP-SGD is proportional to \(\sqrt{the\ number\ of\ steps}\) the model is trained for. As a result, it is important to tune the number of training steps for best privacy/utility trade-offs.</p>
<p>In the <a href="https://github.com/tensorflow/privacy/blob/master/tutorials/mnist_dpsgd_tutorial.py">following tutorial</a>, we provide a small code snippet to train a model with DP-SGD.</p>
<h2 id="model-agnostic-private-learning">Model Agnostic Private Learning</h2>
<p>The Sample and Aggregate framework [NRS07] is a generic method to add differential privacy to a non-private algorithm without caring about the internal workings of it, a.k.a. model agnostic. In the context of machine learning, one can state the main idea as follows: Consider a multi-class classification problem. Take the training data, and split into k disjoint subsets of equal size. Train independent models \(\theta_1, \theta_2, …, \theta_k \) on the disjoint subsets. In order to predict on an test example x, first, compute a private histogram over the set of k predictions \(\theta_1(x), \theta_2(x), …, \theta_k(x) \). Then, select and output the bin in the histogram based on the highest count, after adding a small amount of Laplace/Gaussian noise to the counts. In the context of DP learning, this particular approach was used in two different lines of work: i) PATE [PAE16], and ii) Model agnostic private learning [BTT18]. While the latter focussed on obtaining theoretical privacy/utility trade-offs for a class of learning tasks (e.g., agnostic PAC learning), the PATE approach focuses on practical deployment. Both these lines of work make one common observation. If the predictions from \(\theta_1(x), \theta_2(x), …, \theta_k(x) \) are fairly consistent, then the privacy cost in terms of DP is very small. Hence, one can run a large number of prediction queries, without violating DP constraints. In the following, we describe the PATE approach in detail.</p>
<p>The private aggregation of teacher ensembles (PATE) demonstrated in particular that this approach allows one to learn deep neural networks with differential privacy. It proposes to have an ensemble of models trained without privacy predict with differential privacy by having these models predict in aggregate rather than revealing their individual predictions. In PATE, we start by partitioning the private dataset into smaller subsets of data. These subsets are partitions, so there is no overlap between the data included in any pair of partitions. If Jane Smith’s record was in our private dataset, then it is included in one of the partitions only. That is, only one of the teachers has analyzed Jane Smith’s record during training. We train a ML model, called a teacher, on each of these partitions. We now have an ensemble of teacher models that were trained independently, but without any guarantees of privacy. How do we use this ensemble to make predictions that respect privacy? In PATE, we add noise while aggregating the predictions made individually by each teacher to form a single common prediction. We count the number of teachers who voted for each class, and then perturb that count by adding random noise sampled from the Laplace or Gaussian distribution. Each label predicted by the noisy aggregation mechanism comes with rigorous differential privacy guarantees that bound the privacy budget spent to label that input. Again, stronger differential privacy is achieved when we are able to introduce more noise in the aggregation and are able to answer as few queries as possible. Let us now come back to our running example. Imagine that we’d like to use the output of PATE to know if Jane likes a particular movie. The only teacher trained on the partition containing Jane Smith’s data—has now learned that a record similar to Jane’s is characteristic of an individual who likes similar movies, and as a consequence changes its prediction on a test input which is similar to Jane’s to predict the movie rating assigned by Jane. However, because the teacher only contributes a single vote to the aggregation, and that the aggregation injects noise, we won’t be able to know whether the teacher changed its prediction to the movie rating assigned by Jane because the teacher indeed trained on Jane’s data or because the noise injected during the aggregation “flipped” that teacher’s vote. The random noise added to vote counts prevents the outcome of aggregation from reflecting the votes of any individual teachers to protect privacy.</p>
<h2 id="practically-deploying-differential-privacy-in-machine-learning">Practically deploying differential privacy in machine learning</h2>
<p>The two approaches we introduced have the advantage of being conceptually simple to understand. Fortunately, there also exist several open-source implementations of these approaches. For instance, DP-SGD is implemented in TensorFlow Privacy, Objax, and Opacus. This means that one is able to take an existing TensorFlow, JAX, or PyTorch pipeline for training a machine learning model and replace a non-private optimizer with DP-SGD. An example implementation of PATE is also available in TensorFlow Privacy. So what are the concrete potential obstacles to deploying machine learning with differential privacy?</p>
<p>The first obstacle is the accuracy of privacy-preserving models. Datasets are often sampled from distribution with heavy tails. For instance, in a medical application, there are typically (and fortunately) fewer patients with a given medical condition than patients without that condition. This means that there are fewer training examples for patients with each medical condition to learn from. Because differential privacy prevents us from learning patterns which are not found generally across the training data, it limits our ability to learn from these patients for which we have very few examples of [SPG]. More generally, there is often a trade-off between the accuracy of a model and the strength of the differential privacy guarantee it was trained with: the smaller the privacy budget is, the larger the impact on accuracy typically is. That said, this tension is not always inevitable and there are instances where privacy and accuracy are synergical because differential privacy implies generalization [DFH15] (but not vice versa).</p>
<p>The second obstacle to deploying differentially private machine learning can be the computational overhead. For instance, in DP-SGD one must compute per-example gradients rather than average gradients. This often means that optimizations implemented in machine learning frameworks to exploit matrix algebra supported by underlying hardware accelerators (e.g., GPUs) are harder to take advantage of. In another example, PATE requires that one train multiple models (the teachers) rather than a single model so this can also introduce overhead in the training procedure. Fortunately, this cost is mostly mitigated in recent implementations of private learning algorithms, in particular in Objax and Opacus.</p>
<p>The third obstacle to deploying differential privacy, in machine learning but more generally in any form of data analysis, is the choice of privacy budget. The smaller the budget, the stronger the guarantee is. This means one can compare two analyses and say which one is “more private”. However, this also means that it is unclear what is “small enough” of a privacy budget. This is particularly problematic given that applications of differential privacy to machine learning often require a privacy budget that provides little theoretical guarantees in order to train a model whose accuracy is large enough to warrant a useful deployment. Thus, it may be interesting for practitioners to evaluate the privacy of their machine learning algorithm by attacking it themselves. Whereas the theoretical analysis of an algorithm’s differential privacy guarantees provides a worst-case guarantee limiting how much private information the algorithm can leak against any adversary, implementing a specific attack can be useful to know how successful a particular adversary or class of adversaries would be. This helps interpret the theoretical guarantee but may not be treated as a direct substitute for it. Open-source implementations of such attacks are increasingly available: e.g., for membership inference <a href="https://github.com/tensorflow/privacy/tree/master/tensorflow_privacy/privacy/privacy_tests/membership_inference_attack">here</a> and <a href="https://github.com/cchoquette/membership-inference">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In the above, we discussed some of the algorithmic approaches towards differentially private model training which have been effective both in theoretical and practical settings. Since it is a rapidly growing field, we could not cover all the important aspects of the research space. Some prominent ones include: i) Choice of the best hyperparameters in the training of DP models.In order to ensure that the overall algorithm preserves differential privacy, one needs to ensure that the choice of hyperparameters itself preserves DP. Recent research has provided algorithms for selecting the best hyperparameters in a differentially private fashion [LT19]. ii) Choice of network architecture: it is not always true that the best known model architectures for non-private model training are indeed the best for training with differential privacy. In particular, we know that the number of model parameters may have adverse effects on the privacy/utility trade-offs [BST14]. Hence, choosing the right model architecture is important for providing a good privacy/utility trade-off [PTS21]. (iii) Training in the federated/distributed setting: in the above exposition, we assumed that the training data lies in a single centralized location. However, in settings like Federated Learning (FL) [MMRHA17], the data records can be highly distributed, e.g., across various mobile devices. Running DP-SGD in the FL setting, which is required for FL to provide privacy guarantees for the training data, raises a series of challenges [KMA19] which are often facilitated by distributed private learning algorithms designed specifically for FL settings [BKMTT20, KMSTTZ21]. Some of the specific challenges in the context of FL include, limited and non-uniform availability of clients (holding individual data records) and unknown (and variable) size of the training data [BKMTT18]. On the other hand, PATE style algorithms lend themselves naturally to the distributed setting once combined with existing cryptographic primitives, as demonstrated by the CaPC protocol [CDD21]. It is an active area of research to address these above challenges.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The authors would like to thank Thomas Steinke and Andreas Terzis for detailed feedback and edit suggestions. Parts of this blog post previously appeared on <a href="www.cleverhans.io">www.cleverhans.io</a>.</p>
<h2 id="citations">Citations</h2>
<p>[ACG16] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308-318). ACM.</p>
<p>[BBG18] Balle, B., Barthe, G., & Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. arXiv preprint arXiv:1807.01647.</p>
<p>[BKMTT18] Balle, B., Kairouz P., McMahan M., Thakkar O. & Thakurta A. (2020). Privacy amplification via random check-ins. In NeurIPS.</p>
<p>[MMRHA17] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.</p>
<p>[KMSTTZ18] Kairouz P., McMahan M., Song S., Thakkar O., Thakurta A., & Xu Z. (2021). Practical and Private (Deep) Learning without Sampling or Shuffling. In ICML.</p>
<p>[BFTT19] Bassily, R., Feldman, V., Talwar, K., & Thakurta, A. Private Stochastic Convex Optimization with Optimal Rates. In NeurIPS 2019.</p>
<p>[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science.</p>
<p>[BTT18] Bassily, R., Thakurta, A. G., & Thakkar, O. D. (2018). Model-agnostic private learning. Advances in Neural Information Processing Systems.</p>
<p>[CDD21] Choquette-Choo, C. A., Dullerud, N., Dziedzic, A., Zhang, Y., Jha, S., Papernot, N., & Wang, X. (2021). CaPC Learning: Confidential and Private Collaborative Learning. arXiv preprint arXiv:2102.05188.</p>
<p>[CMS11] Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).</p>
<p>[CTW20] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Raffel, C. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.</p>
<p>[DFH15] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). Generalization in adaptive data analysis and holdout reuse. arXiv preprint arXiv:1506.02629.</p>
<p>[DMNS06] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006, March). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (pp. 265-284). Springer, Berlin, Heidelberg.</p>
<p>[DNPR10] Dwork, C., Naor, M., Pitassi, T., & Rothblum, G. N. (2010, June). Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing (pp. 715-724).</p>
<p>[DTTZ14] Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014, May). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11-20).</p>
<p>[HSL20] Huang, Y., Song, Z., Li, K., & Arora, S. (2020, November). Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning (pp. 4507-4518). PMLR.</p>
<p>[HR12] Hardt, M., & Roth, A. (2012, May). Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (pp. 1255-1268).</p>
<p>[HR13] Hardt, M., & Roth, A. (2013, June). Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 331-340).</p>
<p>[JPS16] Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35</p>
<p>[JTT18] Jain, P., Thakkar, O. D., & Thakurta, A. (2018, July). Differentially private matrix completion revisited. In International Conference on Machine Learning (pp. 2215-2224). PMLR.</p>
<p>[INS19] Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., & Wang, L. (2019, May). Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP) (pp. 299-316). IEEE.</p>
<p>[KST12] Kifer, D., Smith, A., & Thakurta, A. (2012, June). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory (pp. 25-1). JMLR Workshop and Conference Proceedings.</p>
<p>[KMA19] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., … & Zhao, S. (2019). Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977.</p>
<p>[KV05] Kalai, Adam, and Santosh Vempala. “Efficient algorithms for online decision problems.” Journal of Computer and System Sciences 71.3 (2005): 291-307.</p>
<p>[KLNRS08] Raskhodnikova, S., Smith, A., Lee, H. K., Nissim, K., & Kasiviswanathan, S. P. (2008). What can we learn privately. In Proceedings of the 54th Annual Symposium on Foundations of Computer Science (pp. 531-540).</p>
<p>[LLV07] Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106-115). IEEE.</p>
<p>[LT19] Liu, J., & Talwar, K. (2019, June). Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (pp. 298-309).</p>
<p>[M17] Mironov, I. (2017, August). Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th (pp. 263-275). IEEE.</p>
<p>[MKG07] Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). “L-diversity: Privacy Beyond K-anonymity”. ACM Transactions on Knowledge Discovery from Data.</p>
<p>[NRS07] Nissim, K., Raskhodnikova, S., & Smith, A. (2007, June). Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 75-84).</p>
<p>[NS08] Narayanan, A., & Shmatikov, V. (2008, May). Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (pp. 111-125). IEEE.</p>
<p>[PAE16] Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. ICLR 2017.</p>
<p>[PTS21] Papernot, N., Thakurta, A., Song, S., Chien, S., & Erlingsson, U. (2020). Tempered sigmoid activations for deep learning with differential privacy. AAAI 2021.</p>
<p>[RCR15] Rudi, A., Camoriano, R., & Rosasco, L. (2015, December). Less is More: Nyström Computational Regularization. In NIPS (pp. 1657-1665).</p>
<p>[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP ’13, pages 245–248, Washington, DC, USA, 2013. IEEE Computer Society.</p>
<p>[SPG] Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings. Vinith Suriyakumar, Nicolas Papernot, Anna Goldenberg, Marzyeh Ghassemi. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.</p>
<p>[SS98] Samarati, Pierangela; Sweeney, Latanya (1998). “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression” (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017</p>
<p>[SSS17] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017, May). Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3-18). IEEE.</p>
<p>[SSTT21] Song, S., Thakkar, O., & Thakurta, A. (2020). Evading the Curse of Dimensionality in Unconstrained Private GLMs. In AISTATS 2021.</p>
<p>[XT07] Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD conference, Beijing, China, pp 689–700</p>
<p>[YEO21] Yala, A., Esfahanizadeh, H., Oliveira, R. G. D., Duffy, K. R., Ghobadi, M., Jaakkola, T. S., … & Medard, M. (2021). NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training. arXiv preprint arXiv:2106.02484.</p>
<p>[ZHS19] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.</p>
Nicolas PapernotAbhradeep ThakurtaMon, 25 Oct 2021 11:00:00 -0700
https://differentialprivacy.org/how-to-deploy-ml-with-dp/
https://differentialprivacy.org/how-to-deploy-ml-with-dp/One-shot DP Top-k mechanisms<p>In the last <a href="https://differentialprivacy.org/exponential-mechanism-bounded-range/"><em>blog post</em></a>, we showed that the exponential mechanism enjoys improved composition bounds over general pure DP mechanisms due to a property called <strong>bounded range</strong>. For this post, we will present another useful, and somewhat surprising, property of the exponential mechanism in its application of top-\(k\) selection.</p>
<h2 id="differentially-private-top-k-selection">Differentially Private Top-\(k\) Selection</h2>
<p>We will focus on datasets that are a vector of counts \(h = (h_1, \cdots, h_d) \in \mathbb{N}^d\), which consist of counts \(h_i\) for elements from a universe \(\mathcal{U}\) where \(|\mathcal{U}| = d\). Let’s assume that a user’s data can modify each count by at most 1, yet can change all \(d\) counts, i.e. the \(\ell_\infty\)-sensitivity is 1 and the \(\ell_0\)-sensitivity is \(d\). The task here is to return the top-\(k\) elements from the input counts in a differentially private way.</p>
<p>For top-\(1\), this is simply returning the element with the max count, and this is precisely the problem that the exponential mechanism is set up to solve. Let’s write out the exponential mechanism \(M^{(1)}: \mathbb{N}^d \to [d]\) for this instance:
\[
\mathbb{P}[M^{(1)}(h) = i] = \frac{e^{ \varepsilon h_i }}{\sum_{j \in [d] } e^{ \varepsilon h_j } }, \qquad \forall i \in [d].
\]
For those wondering why this formula omits the factor of \(1/2\) in the exponent, we are using the <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">stronger result</a> of the exponential mechanism which replaces global sensitivity with the range of the loss function \(\ell(i,h) = - h_i\), which is \(1\) in this case. Recall from the last blog post that the exponential mechanism is \(\varepsilon^2/8\)-CDP.</p>
<p>Hence, to generalize this to top-\(k\) selection, we can simply iteratively apply this exponential mechanism by removing the <em>discovered</em> element from each previous round. That is, we write \(M^{(k)}: \mathbb{N}^d \to [d]^k\) as the following for any outcome \( (i_1, i_2, \cdots, i_k) \in [d]^k\),</p>
<p>\[
\mathbb{P}[M^{(k)}(h) = (i_1, i_2, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
<a name="eq:peelingEM"></a>
\[\qquad = \frac{e^{ \varepsilon h_{i_1} }}{\sum_{j \in [d] } e^{ \varepsilon h_j } } \cdot \frac{ e^{\varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1\}} e^{\varepsilon h_j } } \cdot \cdots \cdot \frac{ e^{ \varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1, \cdots, i_{k-1}\}} e^{ \varepsilon h_j } }.
\tag{1}
\]</p>
<p>We can then apply composition to conclude that \(M^{(k)}\) is \(k \varepsilon^2/8\)-CDP.</p>
<h2 id="gumbel-noise-and-the-exponential-mechanism">Gumbel Noise and the Exponential Mechanism</h2>
<p>As we discussed in our last post, we can implement the exponential mechanism by adding <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel</a> noise to each count and reporting the noisy max element. A Gumbel random variable \(X \sim \text{Gumbel}(\beta) \), parameterized by scale parameter \(\beta>0\), has the following density function
<a name="eq:GumbelDensity"></a>
\[
p(x;\beta) = \frac{1}{\beta} \exp\left( - x/\beta - e^{-x/\beta} \right), \qquad \forall x \in \mathbb{R}.
\tag{2}
\]</p>
<p>Hence, we can write the exponential mechanism in the following way
\[
M^{(1)}(h) = \arg\max \{ h_i + X_i : i \in [d] \}, \qquad \{X_i \} \stackrel{i.i.d.}{\sim} \text{Gumbel}(1/\varepsilon).
\]</p>
<p>We can then extend this to top-\(k\) by repeatedly adding independent Gumbel noise to each count and removing the discovered element for the next round. However, something that would significantly improve run time would be to add Gumbel noise to each count <em>once</em> and then take the elements with the top-\(k\) noisy counts. We could then add only \(d\) many noise terms, rather than \(O(d^k)\) noise terms if we were to iteratively run \(k\) different exponential mechanisms. The question is, does this one-shot top-\(k\) Gumbel noise mechanism ensure the same level of privacy?</p>
<p>Let’s denote the one-shot Gumbel mechanism as \(\tilde{M}^{(k)}\). At first glance, it does not seem like the one-shot Gumbel mechanism \(\tilde{M}^{(k)}\) should be just as private as the iterative exponential mechanism \(M^{(k)}\), but it turns out they are exactly the same mechanism! The following result is due to <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a>.</p>
<blockquote>
<p><strong>Theorem 1</strong>
For any input vector of counts \(h \in \mathbb{N}^d\), the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) and iteratively applying the exponential mechanism \(M^{(k)}(h)\) are equal in distribution.</p>
</blockquote>
<p><em>Proof.</em>
Recall the distribution of the iterative exponential mechanism \(M^{(k)}(h)\) from <a href="#eq:peelingEM">(1)</a>.
Now we consider the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) where we use the density of \(X \sim \) Gumbel\( (1/\varepsilon)\) from <a href="#eq:GumbelDensity">(2)</a>.<br />
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \int_{-\infty}^\infty p(u_1 - h_{i_1}) \int_{-\infty}^{u_1} p(u_2 - h_{i_2}) \cdots \int_{-\infty}^{u_{k-1}} p(u_k - h_k)
\]
<a name="eq:integral"></a>
\[
\qquad \qquad \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[ X < u_k - h_j]du_k \cdots du_2 du_1.
\tag{3}
\]
Note that we have
\[
\mathbb{P}[X < y] = \exp\left( - \exp\left( -\varepsilon y \right) \right).
\]
Let’s focus on the inner integral over \(u_k\) in <a href="#eq:integral">(3)</a>.<br />
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\quad = \int_{-\infty}^{u_{k-1}} \varepsilon \cdot \exp\left( - \varepsilon (u_k - h_{i_k}) - e^{ -\varepsilon (u_k - h_{i_k}) } \right) \cdot \exp\left( -e^{-\varepsilon u_k} \sum_{j \in [d] \setminus \{i_1, \cdots, i_k \} } e^{\varepsilon h_j} \right) du_k
\]
\[
\quad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left( e^{\varepsilon h_{i_k}} + \sum_{j \in [d] \setminus \{i_1, \cdots i_k \} } e^{\varepsilon h_j} \right) \right) du_k \qquad
\]
<a name="eq:lastLine"></a>
\[
\qquad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left(\sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \right) \right) du_k. \qquad \qquad
\tag{4}
\]
We now integrate with a \(v\)-substitution,
\[
v =e^{-\varepsilon u_{k}} \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j}<br />
\]
\[
dv = - \varepsilon \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \cdot e^{-\varepsilon u_{k}} du_{k}.
\]</p>
<p>Continuing with <a href="#eq:lastLine">(4)</a>, we get
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \exp\left( - e^{-\varepsilon u_{k-1}} \cdot \sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j} \right)
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \} } \mathbb{P}[X < u_{k-1} - h_j ] .
\]
Note how this line has the last term in the expression for \(M^{(k)}(h)\) in <a href="#eq:peelingEM">(1)</a>, which is independent of \(u_{k-1}\) and can hence be pulled out of the larger integral in <a href="#eq:integral">(3)</a>. By induction, we have
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_1} }}{\sum_{j \in [d]} e^{\varepsilon h_j}} \cdot \frac{e^{\varepsilon h_{i_2} }}{\sum_{j \in [d] \setminus \{ i_1\}} e^{\varepsilon h_j}} \cdot \cdots \cdot \frac{e^{\varepsilon h_k }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}}
\]
\[
\qquad = \mathbb{P}[M^{(k)}(h) =(i_1, \cdots, i_k)].
\] ∎</p>
<p>So that’s great! We can now run the one-shot Gumbel mechanism for top-\(k\) and still get the improved composition bounds of the exponential mechanism. In addition to this achieving better runtime, this analysis can help with proving top-\(k\) DP algorithms over a large domain universe despite giving access to only the true top-\(\bar{k}\) items and their counts where \(\bar{k} > k \), see <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> for more details.</p>
<h2 id="report-noisy-max-for-dp-top-k">Report Noisy Max for DP Top-\(k\)</h2>
<p>We now turn to comparing this algorithm to some natural alternatives. As we discussed in the last post, there is a family of mechanisms, report noisy max (RNM) mechanisms, that ensure differential privacy for the selection problem, and hence the top-\(1\) problem. We showed that the exponential mechanism is equivalent to RNM with Gumbel noise, there is also RNM with Laplace and with Exponential noise, the last being the recently discovered <em>permute-and-flip</em> mechanism <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>.</p>
<p>To then use RNM mechanisms for top-\(k\), we can again iteratively apply them and use composition to get the overall privacy guarantee. However, it turns out that you can also use the Laplace noise version of RNM in one-shot <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>.</p>
<p>We can compare the relative noise that is added to each count in both the Laplace and Gumbel versions. Since <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a> gives their privacy guarantee in terms of approximate \((\varepsilon,\delta )\)-DP, we will now make the comparison there. We first look at the standard deviation for Laplace \( \sigma_{\text{Lap}}\) (using Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>).
\[
\sigma_{\text{Lap}} = \frac{8 \sqrt{2k \ln(d/\delta)}}{\varepsilon}.
\]
Note that the one-shot Laplace mechanism returns counts as well as the indices of the top-\(k\), both of which use Laplace noise with standard deviation \(\sigma_{\text{Lap}}\), so we will also include Laplace noise to the discovered elements in the Gumbel version. That is, we add Gumbel noise with scale \(\sqrt{k}/\varepsilon’\) for the discovery portion and Laplace noise with scale \(2\sqrt{k}/\varepsilon’\) for obtaining their counts, resulting in standard deviation noise \(\sigma_{\text{Gumb}}’\) and \(\sigma_{\text{Lap}}’\), respectively.<br />
\[
\sigma_{\text{Gumb}}’ = \frac{\pi \sqrt{k} }{\sqrt{6}\varepsilon’}, \qquad
\sigma_{\text{Lap}}’ = \frac{2\sqrt{2k}}{\varepsilon’}.
\]
Recall that adding this scale of Gumbel noise and Laplace noise will ensure \(\tfrac{\varepsilon’^2}{8} \)-CDP each, so combining will ensure \(\tfrac{\varepsilon’^2}{4}\)-CDP. We could also use Gaussian noise to return the counts since we are using CDP, but we will analyze it with Laplace noise for comparison. To ensure \((\varepsilon,\delta)\)-DP, we use the CDP to DP conversion from Lemma 3.5 in <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and solve for \(\varepsilon’\). Hence, we get for any \(\delta>0\)
\[
\varepsilon’^2/4 = \left( \sqrt{\ln(1/\delta) + \varepsilon} - \sqrt{\ln(1/\delta)} \right)^2
\]
\[ \implies \varepsilon’ = 2 \sqrt{\ln(1/\delta)} \left( \sqrt{1 + \tfrac{\varepsilon}{\ln(1/\delta)}} - 1 \right).
\]</p>
<p>Let’s consider a typical privacy setting where \(\varepsilon < \ln(1/\delta)\), and use the inequality \(\sqrt{1+x} \geq 1 + x/4\) for \(0<x<1\). Here is a short proof of this inequality:
\[
(1 + x/4)^2 = 1 + x/2 + x^2/16 \leq 1 + x/2 + x/2 = 1 + x.<br />
\]
Note that the privacy guarantee for one-shot Laplace noise only holds when \(\varepsilon < 0.2\) and \(\delta < 0.05\) as stated in Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>. In this case, we have
\[
\varepsilon’ \geq 2 \sqrt{\ln(1/\delta)} \left( 1 + 1/4 \cdot \tfrac{\varepsilon}{\ln(1/\delta)} - 1 \right) = 1/2 \cdot \tfrac{\varepsilon}{\sqrt{\ln(1/\delta)}}.
\]
Plugging \(\varepsilon’\) into the standard deviation of Gumbel and Laplace, we get
\[
\sigma_{\text{Gumb}}’ \leq \frac{2\pi\sqrt{k \ln(1/\delta)}}{\sqrt{6}\cdot \varepsilon}, \qquad \sigma_{\text{Lap}}’ \leq \frac{4\sqrt{2k\ln(1/\delta)}}{\varepsilon}.
\]</p>
<p>Putting this together, we can show that we add significantly less noise for the discovery and releasing noisy count phases,
\[
\sigma_{\text{Gumb}}’\leq \sigma_{\text{Lap}}/4 ,\qquad \sigma_{\text{Lap}}’ \leq \sigma_{\text{Lap}}/2.
\]
Note that these bounds can be improved further with similar analysis.</p>
<p>Although it has not been studied yet whether the permute-and-flip mechanism \(M_{\text{PF}} \) can also ensure DP in one shot by using Exponential noise, we briefly discuss whether it can be bounded range for a similar parameter as the Exponential Mechanism, and hence achieve similar composition bounds. Consider running permute-and-flip on two items \(\{1,2 \}\) with a monotonic quality score \(q: \mathcal{X} \times \{1,2 \} \to \mathbb{R} \) whose sensitivity is 1. Let \(x, x’ \in \mathcal{X} \) be neighbors where
\[
q(x,1) = q(x,2) = 0
\]
\[
q(x’,1) = 0 , \quad q(x’,2) = 1.
\]
Hence, permute-and-flip will return outcome \(1\) or \(2\) with half probability each on dataset \(x\), while with dataset \(x’\) outcome \(1\) occurs with probability \(1/2 \cdot e^{-\varepsilon}\) and outcome \(2\) occurs with probability \( 1/2 + 1/2 \cdot (1 - e^{-\varepsilon})\). We can then compute the bounded range parameter \(\alpha\) as
\[
\frac{\mathbb{P}[M_{\text{PF}}(x’) = 2 ] }{\mathbb{P}[M_{\text{PF}}(x) = 2 ]}\leq e^{\alpha}\frac{\mathbb{P}[M_{\text{PF}}(x’) = 1 ]}{\mathbb{P}[M_{\text{PF}}(x) = 1 ]} \implies \alpha \geq \varepsilon + \ln(2 - e^{-\varepsilon} ).
\]
Note that with \(\varepsilon \gg 1\), we get \(\alpha \) close to \(\varepsilon\), which would be the same bounded range parameter as the exponential mechanism. However, with \(\varepsilon< 1\), we get \(\alpha\) close to \(2\varepsilon\). This example provides a lower bound on the BR parameter for permute-and-flip.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have looked at the top-\(k\) selection problem subject to differential privacy and although there are many different mechanisms to use, the exponential mechanism stands out for several reasons:</p>
<ol>
<li>The exponential mechanism is \(\varepsilon\)-DP and \( \varepsilon^2/8\)-CDP and hence gets improved composition.</li>
<li>Iteratively applying the exponential mechanism for top-\(k\) can be implemented by adding Gumbel noise to each count and returning the elements with the top-\(k\) noisy counts in one-shot.</li>
<li>The one-shot Gumbel mechanism returns a ranked list of \(k\) elements, rather than a set of \(k\) elements.</li>
</ol>
David DurfeeRyan RogersMon, 09 Aug 2021 10:00:00 -0700
https://differentialprivacy.org/one-shot-top-k/
https://differentialprivacy.org/one-shot-top-k/A Better Privacy Analysis of the Exponential Mechanism<p>A basic and frequent task in data analysis is <em>selection</em> – given a set of options \(\mathcal{Y}\), output the (approximately) best one, where “best” is defined by some loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) and a dataset \(x \in \mathcal{X}^n\). That is, we want to output some \(y \in \mathcal{Y}\) that approximately minimizes \(\ell(y,x)\). Naturally, we are interested in <em>private selection</em> – i.e., the output should be differentially private in terms of the dataset \(x\).
This post discusses algorithms for private selection – in particular, we give an improved privacy analysis of the popular exponential mechanism.</p>
<h2 id="the-exponential-mechanism">The Exponential Mechanism</h2>
<p>The most well-known algorithm for private selection is the <a href="https://en.wikipedia.org/wiki/Exponential_mechanism_(differential_privacy)"><em>exponential mechanism</em></a> <a href="https://doi.org/10.1109/FOCS.2007.66" title="Frank McSherry, Kunal Talwar. Mechanism Design via Differential Privacy. FOCS 2007."><strong>[MT07]</strong></a>. The exponential mechanism \(M : \mathcal{X}^n \to \mathcal{Y} \) is a randomized algorithm given by \[\forall x \in \mathcal{X}^n ~ \forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x) = y] = \frac{\exp(-\frac{\varepsilon}{2\Delta} \ell(y,x))}{\sum_{y’ \in \mathcal{Y}} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x)) }, \tag{1}\] where \(\Delta\) is the sensitivity of the loss function \(\ell\) given by \[\Delta = \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \max_{y\in\mathcal{Y}} |\ell(y,x) - \ell(y,x’)|,\tag{2}\] where the supremum is taken over all datasets \(x\) and \(x’\) differing on the data of a single individual (which we denote by \(d(x,x’)\le 1\)).</p>
<p>In terms of utility, we can easily show that <a href="https://arxiv.org/abs/1511.02513" title="Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman. Algorithmic Stability for Adaptive Data Analysis. STOC 2016."><strong>[BNSSSU16]</strong></a> \[\mathbb{E}[\ell(M(x),x)] \le \min_{y \in \mathcal{Y}} \ell(y,x) + \frac{2\Delta}{\varepsilon} \log |\mathcal{Y}|\] for all \(x \in \mathcal{X}^n\) (and we can also give high probability bounds).</p>
<p>It is easy to show that the exponential mechanism satisfies \(\varepsilon\)-differential privacy.
But there is more to this story! We’re going to look at a more refined privacy analysis.</p>
<h2 id="bounded-range">Bounded Range</h2>
<p>The privacy guarantee of the exponential mechanism is more precisely characterized by <em>bounded range</em>. This was observed and defined by David Durfee and Ryan Rogers <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> and further analyzed later <a href="https://arxiv.org/abs/1909.13830" title="Jinshuo Dong, David Durfee, Ryan Rogers. Optimal Differential Privacy Composition for Exponential Mechanisms. ICML 2020."><strong>[DDR20]</strong></a>.</p>
<blockquote>
<p><strong>Definition 1 (Bounded Range).</strong><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\eta\)-bounded range if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, there exists some \(t \in \mathbb{R}\) such that \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \in [t, t+\eta].\] Here \(t\) may depend on the pair of input datasets \(x,x’\), but not on the output \(y\).</p>
</blockquote>
<p>To interpret this definition, we <a href="/flavoursofdelta/">recall the definition of the privacy loss random variable</a>: Define \(f : \mathcal{Y} \to \mathbb{R}\) by \[f(y) = \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right).\] Then the privacy loss random variable \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is given by \(Z = f(M(x))\).</p>
<p>Pure \(\varepsilon\)-differential privacy is equivalent to demanding that the privacy loss is bounded by \(\varepsilon\) – i.e., \(\mathbb{P}[|Z|\le\varepsilon]=1\). Approximate \((\varepsilon,\delta)\)-differential privacy is, roughly, equivalent to demanding that \(\mathbb{P}[Z\le\varepsilon]\ge1-\delta\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>Now \(\eta\)-bounded range is simply demanding that the privacy loss \(Z\) is supported on some interval of length \(\eta\). This interval \([t,t+\eta]\) may depend on the pair \(x,x’\).</p>
<p>Bounded range and pure differential privacy are equivalent up to a factor of 2 in the parameters:</p>
<blockquote>
<p><strong>Lemma 2 (Bounded Range versus Pure Differential Privacy).</strong></p>
<ul>
<li>\(\varepsilon\)-differential privacy implies \(\eta\)-bounded range with \(\eta \le 2\varepsilon\).</li>
<li>\(\eta\)-bounded range implies \(\varepsilon\)-differential privacy with \(\varepsilon \le \eta\).</li>
</ul>
</blockquote>
<p><em>Proof.</em> The first part of the equivalence follows from the fact that pure \(\varepsilon\)-differential privacy implies the privacy loss is supported on the interval \([-\varepsilon,\varepsilon]\). Thus, if we set \(t=-\varepsilon\) and \(\eta=2\varepsilon\), then \([t,t+\eta] = [-\varepsilon,\varepsilon]\).
The second part follows from the fact that the support of the privacy loss \([t,t+\eta]\) must straddle \(0\). That is, the privacy loss cannot be always positive nor always negative, so \(0 \in [t,t+\eta]\) and, hence, \([t,t+\eta] \subseteq [-\eta,\eta]\). Otherwise \(\forall y ~ f(y)>0\) or \(\forall y ~ f(y)<0\) would imply \(\forall y ~ \mathbb{P}[M(x)=y]>\mathbb{P}[M(x’)=y]\) or \(\forall y ~ \mathbb{P}[M(x)=y]<\mathbb{P}[M(x’)=y]\), contradicting the fact that \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x)=y] = 1\) and \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x’)=y] = 1\). ∎</p>
<p>OK, back to the exponential mechanism:</p>
<blockquote>
<p><strong>Lemma 3 (The Exponential Mechanism is Bounded Range).</strong>
The exponential mechanism (given in Equation 1 above) satisfies \(\varepsilon\)-bounded range .<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
</blockquote>
<p><em>Proof.</em>
We have \[e^{f(y)} = \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x))}{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x’))} \cdot \frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}.\]
Setting \(t = \log\left(\frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}\right) - \frac{\varepsilon}{2}\), we have \[ f(y) = \frac{\varepsilon}{2\Delta} (\ell(y,x’)-\ell(y,x)+\Delta) + t.\]
By the definition of sensitivity (given in Equation 2), we have \( 0 \le \ell(y,x’)-\ell(y,x)+\Delta \le 2\Delta\), whence \(t \le f(y) \le t + \varepsilon\). ∎</p>
<p>Bounded range is not really a useful privacy definition on its own. Thus we’re going to relate it to a relaxed version of differential privacy next.</p>
<h2 id="concentrated-differential-privacy">Concentrated Differential Privacy</h2>
<p>Concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and its variants <a href="https://arxiv.org/abs/1603.01887" title="Cynthia Dwork, Guy N. Rothblum. Concentrated Differential Privacy. 2016."><strong>[DR16]</strong></a> <a href="https://arxiv.org/abs/1702.07476" title="Ilya Mironov. Rényi Differential Privacy. CCS 2017."><strong>[M17]</strong></a> are relaxations of pure differential privacy with many nice properties. In particular, it composes very cleanly.</p>
<blockquote>
<p><strong>Definition 4 (Concentrated Differential Privacy).</strong>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual,
\[\forall \lambda > 0 ~~~~~ \mathbb{E}[\exp( \lambda Z)] \le \exp(\lambda(\lambda+1)\rho),\tag{3}\]
where \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is the privacy loss random variable.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
</blockquote>
<p>Intuitively, concentrated differential privacy requires that the privacy loss is subgaussian. Specifically, the bound on the moment generating function of \(\rho\)-concentrated differential privacy is tight if the privacy loss \(Z\) follows the distribution \(\mathcal{N}(\rho,2\rho)\). Indeed, the privacy loss random variable of the Gaussian mechanism has such a distribution.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>OK, back to the exponential mechanism:
We know that \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>.
This, of course, applies to the exponential mechaism. A cool fact – that we want to draw more attention to – is that we can do better!
Specifically, \(\eta\)-bounded range implies \(\frac18 \eta^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/2004.07223" title="Mark Cesar, Ryan Rogers. Bounding, Concentrating, and Truncating: Unifying Privacy Loss Composition for Data Analytics. ALT 2021."><strong>[CR21]</strong></a>.
What follows is a proof of this fact following that of Mark Cesar and Ryan Rogers, but with some simplification.</p>
<blockquote>
<p><strong>Theorem 5 (Bounded Range implies Concentrated Differential Privacy).</strong>
If \(M\) is \(\eta\)-bounded range, then it is \(\frac18\eta^2\)-concentrated differentially private.</p>
</blockquote>
<p><em>Proof.</em>
Fix datasets \(x,x’ \in \mathcal{X}^n\) differing on a single individual’s data.
Let \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) be the privacy loss random variable of the mechanism \(M\) on this pair of datasets.
By the definition of bounded range (Definition 1), there exists some \(t \in \mathbb{R}\) such that \(Z \in [t, t+\eta]\) with probability 1.
Now we employ <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_lemma">Hoeffding’s Lemma</a> <a href="https://doi.org/10.1080%2F01621459.1963.10500830" title="Wassily Hoeffding. Probability inequalities for sums of bounded random variables. JASA 1963."><strong>[H63]</strong></a>:</p>
<blockquote>
<p><strong>Lemma 6 (Hoeffding’s Lemma).</strong>
Let \(X\) be a random variable supported on the interval \([a,b]\). Then, for all \(\lambda \in \mathbb{R}\), we have \[\mathbb{E}[\exp(\lambda X)] \le \exp \left( \mathbb{E}[X] \cdot \lambda + \frac{(b-a)^2}{8} \cdot \lambda^2 \right).\]</p>
</blockquote>
<p>Applying the lemma to the privacy loss gives \[\forall \lambda \in \mathbb{R} ~~~~~ \mathbb{E}[\exp(\lambda Z)] \le \exp \left( \mathbb{E}[Z] \cdot \lambda + \frac{\eta^2}{8} \cdot \lambda^2 \right).\]
The only remaining thing we need is to show is that \(\mathbb{E}[Z] \le \frac18 \eta^2\).<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>If we set \(\lambda = -1 \), then we get \( \mathbb{E}[\exp( - Z)] \le \exp \left( -\mathbb{E}[Z] + \frac{\eta^2}{8} \right)\), which rearranges to \(\mathbb{E}[Z] \le \frac18 \eta^2 - \log \mathbb{E}[\exp( - Z)]\).
Now we have \[ \mathbb{E}[\exp( - Z)] \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \exp(-f(y)) \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \!\cdot\! \frac{\mathbb{P}[M(x’)\!=\!y]}{\mathbb{P}[M(x)\!=\!y]} \!=\! 1.\]
∎</p>
<p>This brings us to the TL;DR of this post:</p>
<blockquote>
<p><strong>Corollary 7.</strong> The exponential mechanism (given by Equation 1) is \(\frac18 \varepsilon^2\)-concentrated differentially private.</p>
</blockquote>
<p>This is great news. The standard analysis only gives \(\frac12 \varepsilon^2\)-concentrated differential privacy. Constants matter when applying differential privacy, and we save a factor of 4 in the concentrated differential privacy analysis of the exponential mechanism for free with this improved analysis.</p>
<p>Combining Lemma 2 with Theorem 5 also gives a simpler proof of the conversion from pure differential privacy to concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>:</p>
<blockquote>
<p><strong>Corollary 8.</strong> \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy.</p>
</blockquote>
<h2 id="beyond-the-exponential-mechanism">Beyond the Exponential Mechanism</h2>
<p>The exponential mechanism is not the only algorithm for private selection. A closely-related algorithm is <em>report noisy max/min</em>:<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup> Draw independent noise \(\xi_y\) from some distribution for each \(y \in \mathcal{Y}\) then output \[M(x) = \underset{y \in \mathcal{Y}}{\mathrm{argmin}} ~ \ell(y,x) - \xi_y.\]</p>
<p>If the noise distribution is an appropriate <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a>, then report noisy max is exactly the exponential mechanism. (This equivalence is known as the “Gumbel max trick.”)</p>
<p>We can also use the Laplace distribution or the exponential distribution. Report noisy max with the exponential distribution is equivalent to the <em>permute and flip</em> algorithm <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>. However, these algorithms don’t enjoy the same improved bounded range and concentrated differential privacy guarantees as the exponential mechanism.</p>
<p>There are also other variants of the selection problem. For example, in some cases we can assume that only a few options have low loss and the rest of the options have high loss – i.e., there is a gap between the minimum loss and the second-lowest loss (or, more generally, the \(k\)-th lowest loss). In this case there are algorithms that attain better accuracy than the exponential mechanism under relaxed privacy definitions <a href="https://arxiv.org/abs/1409.2177" title="Kamalika Chaudhuri, Daniel Hsu, Shuang Song. The Large Margin Mechanism for Differentially Private Maximization. NIPS 2014."><strong>[CHS14]</strong></a> <a href="https://dl.acm.org/doi/10.1145/3188745.3188946" title=" Mark Bun, Cynthia Dwork, Guy N. Rothblum, Thomas Steinke. Composable and versatile privacy via truncated CDP. STOC 2018."><strong>[BDRS18]</strong></a> <a href="https://arxiv.org/abs/1905.13229" title="Mark Bun, Gautam Kamath, Thomas Steinke, Zhiwei Steven Wu. Private Hypothesis Selection. NeurIPS 2019."><strong>[BKSW19]</strong></a>.</p>
<p>There are a lot of interesting aspects of private selection, including questions for further research! We hope to have further posts about some of these topics.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For simplicity, we restrict our discussion here to finite sets of outputs, although the definitions, algorithms, and results can be extended to infinite sets. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>To be more precise, \((\varepsilon,\delta)\)-differential privacy is equivalent to demanding that \(\mathbb{E}[\max\{0,1-\exp(\varepsilon-Z)\}]\le\delta\) <a href="https://arxiv.org/abs/2004.00010" title="Clément L. Canonne, Gautam Kamath, Thomas Steinke. The Discrete Gaussian for Differential Privacy. NeurIPS 2020."><strong>[CKS20]</strong></a>. (To be completely precise, we must appropriately deal with the \(Z=\infty\) case, which we ignore in this discussion for simplicity.) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>This proof actually gives <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">a slightly stronger result</a>: We can replace the sensitivity \(\Delta\) (defined in Equation 2) by half the range \[\hat\Delta = \frac12 \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \left( \max_{\overline{y}\in\mathcal{Y}} \ell(\overline{y},x) - \ell(\overline{y},x’) - \min_{\underline{y}\in\mathcal{Y}} \ell(\underline{y},x) - \ell(\underline{y},x’) \right).\] We always have \(\hat\Delta \le \Delta\) but it is possible that \(\hat\Delta < \Delta\) and the privacy analysis of the exponential mechanism still works if we replace \(\Delta\) by \(\hat\Delta\). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Equivalently, a randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, \[\forall \lambda > 0 ~~~~~ \mathrm{D}_{\lambda+1}(M(x)\|M(x’)) \le \lambda(\lambda+1)\rho,\] where \(\mathrm{D}_{\lambda+1}(M(x)\|M(x’)))\) is the order \(\lambda+1\) Rényi divergence of \(M(x)\) from \(M(x’)\). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>To be precise, if \(M(x) = q(x) + \mathcal{N}(0,\sigma^2I)\), then \(M : \mathcal{X}^n \to \mathbb{R}^d\) satisfies \(\frac{\Delta_2^2}{2\sigma^2}\)-concentrated differential privacy, where \(\Delta_2 = \sup_{x,x’\in\mathcal{X}^n : d(x,x’)\le1} \|q(x)-q(x’)\|_2\) is the 2-norm sensitivity of \(q:\mathcal{X}^n \to \mathbb{R}^d\). Furthermore, the privacy loss of the Gaussian mechanism is itself a Gaussian and it makes the inequality defining concentrated differential privacy (Equation 3) an equality for all \(\lambda\) <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Note that the expectation of the privacy loss is simply the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>: \(\mathbb{E}[Z] = \mathrm{D}_1( M(x) \| M(x’) )\). <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>We have presented selection here in terms of minimization, but most of the literature is in terms of maximization. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Ryan RogersThomas SteinkeMon, 12 Jul 2021 10:00:00 -0700
https://differentialprivacy.org/exponential-mechanism-bounded-range/
https://differentialprivacy.org/exponential-mechanism-bounded-range/Open Problem - Optimal Query Release for Pure Differential Privacy<p>Releasing large sets of statistical queries is a centerpiece of the theory of differential privacy. Here, we are given a <em>dataset</em> \(x = (x_1,\dots,x_n) \in [T]^n\), and a set of <em>statistical queries</em> \(f_1,\dots,f_k\), where each query is defined by some bounded function \(f_j : [T] \to [-1,1]\), and (abusing notation) is defined as
\[
f_j(x) = \frac{1}{n} \sum_{i=1}^{n} f_j(x_i).
\]
We use \(f(x) = (f_1(x),\dots,f_k(x))\) to denote the vector consisting of the true answers to all these queries.
Our goal is to design an \((\varepsilon, \delta)\)-differentially private algorithm \(M\) that takes a dataset \(x\in [T]^n\) and outputs a random vector \(M(x)\in \mathbb{R}^k\) such that \(\| M(x) - f(x) \|\) is small in expectation for some norm \(\|\cdot\|\). Usually algorithms for this problem also give high probability bounds on the error, but we focus on expected error for simplicity.</p>
<p>This problem has been studied for both <em>pure differential privacy</em> (\(\delta = 0\)) and <em>appproximate differential privacy</em> (\(\delta > 0\)), and for both \(\ell_\infty\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{\infty} ) \leq \alpha,
\]
and \(\ell_2\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{2} ) \leq \alpha k^{1/2},
\]
giving four variants of the problem. By now we know tight worst-case upper and lower bounds for two of these variants, and nearly tight bounds (up to logarithmic factors) for a third. The tightest known upper bounds are given in the following table.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Pure DP</td>
<td>Approx DP</td>
</tr>
<tr>
<td>\( \ell_2 \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log^2 k ~\cdot~ \log^{3/2}T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://arxiv.org/abs/1212.0297">NTZ13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/drv10.pdf">DRV10</a>]</td>
</tr>
<tr>
<td>\( \ell_\infty \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log T}{\varepsilon n} \right)^{1/3} \) <br /> [<a href="https://arxiv.org/abs/1109.2229">BLR13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/hr10.pdf">HR10</a>, <a href="https://arxiv.org/abs/1107.3731">GRU12</a>]</td>
</tr>
</tbody>
</table>
<p>The bounds for approximate DP are known to be tight [<a href="https://arxiv.org/abs/1311.3158">BUV14</a>]. Our two open problems both involve improving the best known upper bounds for pure differential privacy.</p>
<blockquote>
<p><b>Open Problem 1:</b> What is the best possible \(\ell_\infty\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the known upper bound in the table can be improved to
\[
\alpha = \left(\frac{\log k \cdot \log T}{\varepsilon n} \right)^{1/2},
\]
which is known to be the best possible [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1].</p>
<blockquote>
<p><b>Open Problem 2:</b> What is the best possible \(\ell_2\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the upper bound can be improved to
\[
\alpha = \left(\frac{\log T}{\varepsilon n} \right)^{1/2}.
\]
The construction used in [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1] can be analyzed to show this bound would be tight. Note, in particular, that this conjecture implies that the tight upper bound has no dependence on the number of queries, similarly to the case of \(\ell_2\) error and approximate DP.</p>
Sasho NikolovJonathan UllmanWed, 07 Jul 2021 13:45:00 -0400
https://differentialprivacy.org/open-problem-optimal-query-release/
https://differentialprivacy.org/open-problem-optimal-query-release/Conference Digest - ICML 2021<p><a href="https://icml.cc/Conferences/2021">ICML 2021</a>, one of the biggest conferences in machine learning, naturally has a ton of interesting sounding papers on the topic of differential privacy.
We went through this year’s <a href="https://icml.cc/Conferences/2021/AcceptedPapersInitial">accepted papers</a> and aggregated all the relevant papers we could find.
In addition, this year features three workshops on the topic of privacy, as well as a tutorial.
As always, please inform us if we overlooked any papers on differential privacy.</p>
<h2 id="workshops">Workshops</h2>
<ul>
<li>
<p><a href="http://federated-learning.org/fl-icml-2021/">Federated Learning for User Privacy and Data Confidentiality</a></p>
</li>
<li>
<p><a href="https://sites.google.com/view/ml4data">Machine Learning for Data: Automated Creation, Privacy, Bias</a></p>
</li>
<li>
<p><a href="https://tpdp.journalprivacyconfidentiality.org/2021/">Theory and Practice of Differential Privacy</a></p>
</li>
</ul>
<h2 id="tutorial">Tutorial</h2>
<ul>
<li><a href="https://icml.cc/Conferences/2021/Schedule?showEvent=10839">Privacy in Learning: Basics and the Interplay</a><br />
<a href="https://www.microsoft.com/en-us/research/people/huzhang/">Huishuai Zhang</a>, <a href="https://www.microsoft.com/en-us/research/people/weic/">Wei Chen</a></li>
</ul>
<h2 id="papers">Papers</h2>
<ul>
<li>
<p><a href="https://arxiv.org/abs/2009.02668">A Framework for Private Matrix Analysis in Sliding Window Model</a><br />
<a href="https://sites.google.com/view/jalajupadhyay/home">Jalaj Upadhyay</a>, <a href="https://www.fujitsu.com/us/about/businesspolicy/tech/rd/research-staff/sarvagya.html">Sarvagya Upadhyay</a></p>
</li>
<li>
<p>Accuracy, Interpretability, and Differential Privacy via Explainable Boosting<br />
<a href="https://scholar.google.com/citations?user=HmxjgMAAAAAJ">Harsha Nori</a>, <a href="https://www.microsoft.com/en-us/research/people/rcaruana/">Rich Caruana</a>, <a href="https://sites.google.com/view/zhiqi-bu">Zhiqi Bu</a>, <a href="https://heyyjudes.github.io/">Judy Hanwen Shen</a>, <a href="https://www.microsoft.com/en-us/research/people/jakul/">Janardhan Kulkarni</a></p>
</li>
<li>
<p>Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message<br />
<a href="https://sites.google.com/view/badihghazi/home">Badih Ghazi</a>, <a href="https://sites.google.com/site/ravik53/">Ravi Kumar</a>, <a href="https://pasin30055.github.io/">Pasin Manurangsi</a>, <a href="https://rasmuspagh.net/">Rasmus Pagh</a>, <a href="https://www.linkedin.com/in/amersinha/">Amer Sinha</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2011.00467">Differentially Private Bayesian Inference for Generalized Linear Models</a><br />
<a href="https://warwick.ac.uk/fac/sci/dcs/people/u1554597">Tejas Kulkarni</a>, <a href="https://users.aalto.fi/~jalkoj1/">Joonas Jälkö</a>, <a href="https://scholar.google.com/citations?user=Y_EvCPAAAAAJ">Antti Koskela</a>, <a href="https://people.aalto.fi/samuel.kaski">Samuel Kaski</a>, <a href="https://www.cs.helsinki.fi/u/ahonkela/">Antti Honkela</a></p>
</li>
<li>
<p>Differentially-Private Clustering of Easy Instances<br />
<a href="http://www.cohenwang.com/edith/">Edith Cohen</a>, <a href="http://www.cs.tau.ac.il/~haimk/">Haim Kaplan</a>, <a href="https://www.tau.ac.il/~mansour/">Yishay Mansour</a>, <a href="https://www.uri.co.il/">Uri Stemmer</a>, <a href="https://www.linkedin.com/in/eliad-tsfadia-21482b96/">Eliad Tsfadia</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08885">Differentially Private Correlation Clustering</a><br />
<a href="https://cs-people.bu.edu/mbun/">Mark Bun</a>, <a href="https://elias.ba30.eu/">Marek Elias</a>, <a href="https://www.microsoft.com/en-us/research/people/jakul/">Janardhan Kulkarni</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.13287">Differentially Private Densest Subgraph Detection</a><br />
<a href="https://biocomplexity.virginia.edu/person/dung-nguyen">Dung Nguyen</a>, <a href="https://engineering.virginia.edu/faculty/anil-vullikanti">Anil Vullikanti</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08244">Differentially Private Quantiles</a><br />
<a href="http://jgillenw.com/">Jennifer Gillenwater</a>, <a href="https://www.majos.net/">Matthew Joseph</a>, <a href="https://www.alexkulesza.com/">Alex Kulesza</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.06641">Differentially Private Query Release Through Adaptive Projection</a><br />
<a href="https://sergulaydore.github.io/">Sergul Aydore</a>, <a href="https://wibrown.github.io/">William Brown</a>, <a href="https://www.cis.upenn.edu/~mkearns/">Michael Kearns</a>, <a href="http://www-cs-students.stanford.edu/~kngk/">Krishnaram Kenthapadi</a>, <a href="https://www.lucamel.is/">Luca Melis</a>, <a href="https://www.cis.upenn.edu/~aaroth/">Aaron Roth</a>, <a href="https://ankitsiva.xyz/">Ankit Siva</a></p>
</li>
<li>
<p>Differentially Private Sliced Wasserstein Distance<br />
<a href="http://asi.insa-rouen.fr/enseignants/~arakoto/">Alain Rakotomamonjy</a>, <a href="https://pageperso.lif.univ-mrs.fr/~liva.ralaivola/doku.php">Liva Ralaivola</a></p>
</li>
<li>
<p>Large Scale Private Learning via Low-rank Reparametrization<br />
<a href="https://scholar.google.com/citations?user=FcRGdiwAAAAJ">Da Yu</a>, <a href="https://www.microsoft.com/en-us/research/people/huzhang/">Huishuai Zhang</a>, <a href="https://www.microsoft.com/en-us/research/people/weic/">Wei Chen</a>, Jian Yin, <a href="https://www.microsoft.com/en-us/research/people/tyliu/">Tie-Yan Liu</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.08598">Leveraging Public Data for Practical Private Query Release</a><br />
<a href="https://www.linkedin.com/in/terrance-liu-26796974/">Terrance Liu</a>, <a href="https://sites.google.com/umn.edu/giuseppe-vietri/home">Giuseppe Vietri</a>, <a href="http://www.thomas-steinke.net/">Thomas Steinke</a>, <a href="https://www.ccs.neu.edu/home/jullman/">Jonathan Ullman</a>, <a href="https://zstevenwu.com/">Steven Wu</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2104.09734">Locally Private k-Means in One Round</a><br />
Alisa Chang, <a href="https://sites.google.com/view/badihghazi/home">Badih Ghazi</a>, <a href="https://sites.google.com/site/ravik53/">Ravi Kumar</a>, <a href="https://pasin30055.github.io/">Pasin Manurangsi</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.12099">Lossless Compression of Efficient Private Local Randomizers</a><br />
<a href="http://vtaly.net/">Vitaly Feldman</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.08233">Oneshot Differentially Private Top-k Selection</a><br />
<a href="https://lsa.umich.edu/stats/people/phd-students/qiaogang.html">Gang Qiao</a>, <a href="http://www-stat.wharton.upenn.edu/~suw/">Weijie Su</a>, <a href="https://research.google/people/LiZhang/">Li Zhang</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2002.12321">PAPRIKA: Private Online False Discovery Rate Control</a><br />
<a href="https://wanrongz.github.io/">Wanrong Zhang</a>, <a href="http://www.gautamkamath.com/">Gautam Kamath</a>, <a href="https://sites.gatech.edu/rachel-cummings/">Rachel Cummings</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.00039">Practical and Private (Deep) Learning without Sampling or Shuffling</a><br />
<a href="https://kairouzp.github.io/">Peter Kairouz</a>, <a href="https://research.google/people/author35837/">Brendan McMahan</a>, <a href="https://shs037.github.io/">Shuang Song</a>, <a href="http://www.omthakkar.com/">Om Thakkar</a>, <a href="https://athakurta.squarespace.com/">Abhradeep Thakurta</a>, <a href="https://research.google/people/106689/">Zheng Xu</a></p>
</li>
<li>
<p>Private Adaptive Gradient Methods for Convex Optimization<br />
<a href="http://web.stanford.edu/~asi/">Hilal Asi</a>, <a href="https://web.stanford.edu/~jduchi/">John Duchi</a>, <a href="https://afallah.lids.mit.edu/">Alireza Fallah</a>, <a href="https://scholar.google.com/citations?user=_JXjrEp9FhYC">Omid Javidbakht</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p>Private Alternating Least Squares: (Nearly) Optimal Privacy/Utility Trade-off for Matrix Completion<br />
Steve Chien, <a href="https://www.prateekjain.org/">Prateek Jain</a>, <a href="http://walid.krichene.net/">Walid Krichene</a>, <a href="https://scholar.google.com/citations?user=yR-ugIoAAAAJ">Steffen Rendle</a>, <a href="https://shs037.github.io/">Shuang Song</a>, <a href="https://athakurta.squarespace.com/">Abhradeep Thakurta</a>, <a href="https://research.google/people/LiZhang/">Li Zhang</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.01516">Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry</a><br />
<a href="http://web.stanford.edu/~asi/">Hilal Asi</a>, <a href="http://vtaly.net/">Vitaly Feldman</a>, <a href="https://tomerkoren.github.io/">Tomer Koren</a>, <a href="http://kunaltalwar.org/">Kunal Talwar</a></p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.06387">The Distributed Discrete Gaussian Mechanism for Federated Learning with Secure Aggregation</a><br />
<a href="https://kairouzp.github.io/">Peter Kairouz</a>, <a href="https://kenziyuliu.github.io/">Ziyu Liu</a>, <a href="http://www.thomas-steinke.net/">Thomas Steinke</a></p>
</li>
</ul>
Gautam KamathMon, 07 Jun 2021 12:30:00 -0400
https://differentialprivacy.org/icml2021/
https://differentialprivacy.org/icml2021/Statistical Inference is Not a Privacy Violation<p>On April 28, 2021, the US Census Bureau <a href="https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance/2020-das-updates.html">released</a> a new demonstration of its differentially private Disclosure Avoidance System (DAS) for the 2020 US Census. The public were given a month to submit feedback before the system is finalized.
This demonstration data and the feedback has generated a lot of discussion, including media coverage on <a href="https://www.npr.org/2021/05/19/993247101/for-the-u-s-census-keeping-your-data-anonymous-and-useful-is-a-tricky-balance">National Public Radio</a>, in <a href="https://www.washingtonpost.com/local/social-issues/2020-census-differential-privacy-ipums/2021/06/01/6c94b46e-c30d-11eb-93f5-ee9558eecf4b_story.html">the Washington Post</a>, and via <a href="https://apnews.com/article/business-census-2020-technology-e701e313e841674be6396321343b7e49">the Associated Press</a>. The DAS is also the subject of an <a href="https://www.courtlistener.com/docket/59728874/state-v-united-states-department-of-commerce/">ongoing lawsuit</a>.</p>
<p>The following is a response from experts on differential privacy and cryptography to the <a href="https://alarm-redist.github.io/posts/2021-05-28-census-das/Harvard-DAS-Evaluation.pdf">working paper of Kenny et al.</a> on the impact of the 2020 U.S. Census Disclosure Avoidance System (DAS) on redistricting.</p>
<p>This paper makes a <a href="https://github.com/frankmcsherry/blog/blob/master/posts/2016-06-14.md">common but serious mistake</a>, from which the authors wrongfully conclude the Census Bureau should not modernize its privacy-protection technology. Not only do the results not support this conclusion, but they instead show the power of the methodology, known as differential privacy, adopted by the Bureau, precisely the opposite of the authors’ erroneous conclusions.</p>
<p>Trust is essential; once destroyed it can be nearly impossible to rebuild, and getting privacy wrong in this Census will have an impact on all future government surveys. The Census Bureau has shown that their <a href="https://desfontain.es/privacy/index.html">2010 (DAS) does not survive modern privacy threats</a>, and in fact was roughly equivalent to publishing nearly three quarters of the responses. The Census Bureau’s decision to modernize its Disclosure Avoidance System (DAS) for the 2020 Decennial Census to be differentially private is the correct response to decades of theoretical and empirical work on the privacy risks inherent in releasing large numbers of statistics derived from a dataset.</p>
<p>The importance of the Census, and the reality that no technology competing with differential privacy exists for meeting their confidentiality obligations, makes it very important that the public and policy makers have accurate information. We imagine you will be reporting on this topic in the future. Others have <a href="https://gerrymander.princeton.edu/DAS-evaluation-Kenny-response">addressed flaws</a> in the paper regarding implications for redistricting; we want to provide you with an understanding of the privacy mistake in the study.</p>
<p>To understand the flaw in the paper’s argument, consider the role of smoking in determining cancer risk. Statistical study of medical data has taught us that smoking causes cancer. Armed with this knowledge, if we are told that 40 year old Mr. S is a smoker, we can conclude that he has an elevated cancer risk. The statistical inference of elevated cancer risk—made before Mr. S was born—did not violate Mr. S’s privacy. To conclude otherwise is to define science to be a privacy attack. This is the mistake made in the paper.</p>
<p>This is basically what Kenny et al. found.</p>
<p>The authors looked at three different predictors: one built directly from (swapped) 2010 Census data and the other two built using differential privacy applied to (swapped) 2010 Census data, and evaluated all three “on approximately 5.8 million registered voters included in the North Carolina February 2021 voter file.” What did they find?</p>
<blockquote>
<p>“Our analysis shows that across three main racial and ethnic groups, the predictions based on the [differential privacy based] DAS data appear to be as accurate as those based on the 2010 Census data.”</p>
</blockquote>
<p>This makes perfect sense. Bayesian Improved Surname Geocoding, or BISG, is a statistical method of building a predictor inferring ethnicity (or race) from name and geography. Here, name and geography play the role of the information as to whether or not one smokes, and the prediction of ethnicity corresponds to the cancer risk prediction. The predictor is constructed from census data on the ethnic makeup of individual census blocks and statistical information about the popularity of individual surnames within different ethnic groups. With such a predictor, moving across the country can change the outcome, as can changing one’s name. But a BISG prediction is not about the individual, it is about the statistical—population-level—relationship between name, geography, and ethnicity.</p>
<p>The differentially private DAS enabled learning to make statistical inferences about ethnicity from name and geography, without compromising the privacy of any Census respondent, exactly as it was intended to do. In other words, the paper establishes fitness-for-use of the DAS data for the BISG statistical method! Because differential privacy permits learning statistical patterns without compromising the privacy of individual members of the dataset, it should not interfere with learning the predictor, which is exactly what the authors found. Returning to our “smoking causes cancer” example, the researchers found that it was just as easy to detect this statistical pattern with a modern disclosure avoidance system in place as it was with the older, less protective system.</p>
<p>The authors’ conclusions –“ the DAS data may not provide universal privacy protection” – are simply not supported by their findings.</p>
<p>They have confused learning that smoking causes cancer—and applying this predictor to an individual smoker—with learning medical details of individual patients in the dataset. Change the input to the predictor—replace “smoker” with “non-smoker” or move across the country, for example—and the prediction changes.</p>
<p>The BISG prediction is not about the individual, it does not accompany her as she relocates from one neighborhood to another, it is a statistical relationship between name, geography, and ethnicity. It is not a privacy compromise, it is science.</p>
<p>Signed:</p>
<ul>
<li>Mark Bun, Assistant Professor of Computer Science, Boston University</li>
<li>Damien Desfontaines, Privacy Engineer, Google</li>
<li>Cynthia Dwork, Professor of Computer Science, Harvard University</li>
<li>Moni Naor, Professor of Computer Science, The Weizmann Institute of Science</li>
<li>Kobbi Nissim, Professor of Computer Science, Georgetown University</li>
<li>Aaron Roth, Professor of Computer and Information Science, University of Pennsylvania</li>
<li>Adam Smith, Professor of Computer Science, Boston University</li>
<li>Thomas Steinke, Research Scientist, Google</li>
<li>Jonathan Ullman, Assistant Professor of Computer Science, Northeastern University</li>
<li>Salil Vadhan, Professor of Computer Science and Applied Mathematics, Harvard University</li>
</ul>
<p>Please contact Cynthia Dwork for contact information for authors happy to speak about this on the record.</p>
Jonathan UllmanThu, 03 Jun 2021 18:30:00 -0500
https://differentialprivacy.org/inference-is-not-a-privacy-violation/
https://differentialprivacy.org/inference-is-not-a-privacy-violation/