Differential PrivacyWebsite for the differential privacy research community
https://differentialprivacy.org
Composition Basics<p>Our data is subject to many different uses. Many entities will have access to our data and those entities will perform many different analyses that involve our data. The greatest risk to privacy is that an attacker will combine multiple pieces of information from the same or different sources and that the combination of these will reveal sensitive details about us.
Thus we cannot study privacy leakage in a vacuum; it is important that we can reason about the accumulated privacy leakage over multiple independent analyses, which is known as <em>composition</em>. We have <a href="/privacy-composition/">previously discussed</a> why composition is so important for differential privacy.</p>
<p>This is the first in a series of posts on <em>composition</em> in which we will explain in more detail how compositoin analyses work.</p>
<p>Composition is quantitative. The differential privacy guarantee of the overall system will depend on the number of analyses and the privacy parameters that they each satisfy. The exact relationship between these quantities can be complex. There are various composition theorems that give bounds on the overall parameters in terms of the parameters of the parts of the system.</p>
<p>The simplest composition theorem is what is known as basic composition, which applies to pure \(\varepsilon\)-DP (although it can be extended to approximate \((\varepsilon,\delta)\)-DP):</p>
<blockquote>
<p><strong>Theorem</strong> (Basic Composition)
Let \(M_1, M_2, \cdots, M_k : \mathcal{X}^n \to \mathcal{Y}\) be randomized algorithms. Suppose \(M_j\) is \(\varepsilon_j\)-DP for each \(j \in [k]\).
Define \(M : \mathcal{X}^n \to \mathcal{Y}^k\) by \(M(x)=(M_1(x),M_2(x),\cdots,M_k(x))\), where each algorithm is run independently. Then \(M\) is \(\varepsilon\)-DP for \(\varepsilon = \sum_{j=1}^k \varepsilon_j\).</p>
</blockquote>
<p><em>Proof.</em>
Fix an arbitrary pair of neighbouring datasets \(x,x’ \in \mathcal{X}^n\) and output \(y \in \mathcal{Y}^k\).
To establish that \(M\) is \(\varepsilon\)-DP, we must show that \(e^{-\varepsilon} \le \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon\). By independence, we have \[\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\prod_{j=1}^k\mathbb{P}[M_j(x)=y_j]}{\prod_{j=1}^k\mathbb{P}[M_j(x’)=y_j]} = \prod_{j=1}^k \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \le \prod_{j=1}^k e^{\varepsilon_j} = e^{\sum_{j=1}^k \varepsilon_j} = e^\varepsilon,\] where the inequality follows from the fact that each \(M_j\) is \(\varepsilon_j\)-DP and, hence, \(e^{-\varepsilon_j} \le \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \le e^{\varepsilon_j}\). Similarly, \(\prod_{j=1}^k \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \ge \prod_{j=1}^k e^{-\varepsilon_j}\), which completes the proof. ∎</p>
<p>Basic composition is already a powerful result, despite its simple proof; it establishes the versatility of differential privacy and allows us to begin reasoning about complex systems in terms of their building blocks. For example, suppose we have \(k\) functions \(f_1, \cdots, f_k : \mathcal{X}^n \to \mathbb{R}\) each of sensitivity \(1\). For each \(j \in [k]\), we know that adding \(\mathsf{Laplace}(1/\varepsilon)\) noise to the value of \(f_j(x)\) satisfies \(\varepsilon\)-DP. Thus, if we add independent \(\mathsf{Laplace}(1/\varepsilon)\) noise to each value \(f_j(x)\) for all \(j \in [k]\), then basic composition tells us that releasing this vector of \(k\) noisy values satisfies \(k\varepsilon\)-DP. If we want the overall system to be \(\varepsilon\)-DP, then we should add independent \(\mathsf{Laplace}(k/\varepsilon)\) noise to each value \(f_j(x)\).</p>
<h2 id="is-basic-composition-optimal">Is Basic Composition Optimal?</h2>
<p>If we want to release \(k\) values each of sensitivity \(1\) (as above) and have the overall release be \(\varepsilon\)-DP, then, using basic composition, we can add \(\mathsf{Laplace}(k/\varepsilon)\) noise to each value. The variance of the noise for each value is \(2k^2/\varepsilon^2\), so the standard deviation is \(\sqrt{2} k /\varepsilon\). In other words, the scale of the noise must grow linearly with the number of values \(k\) if the overall privacy and each value’s sensitivity is fixed. It is natural to wonder whether the scale of the Laplace noise can be reduced by improving the basic composition result. We now show that this is not possible.</p>
<p>For each \(j \in [k]\), let \(M_j : \mathcal{X}^n \to \mathbb{R}\) be the algorithm that releases \(f_j(x)\) with \(\mathsf{Laplace}(k/\varepsilon)\) noise added. Let \(M : \mathcal{X}^n \to \mathbb{R}^k\) be the composition of these \(k\) algorithms. Then \(M_j\) is \(\varepsilon/k\)-DP for each \(j \in [k]\) and basic composition tells us that \(M\) is \(\varepsilon\)-DP. The question is whether \(M\) satisfies a better DP guarantee than this – i.e., does \(M\) satisfy \(\varepsilon_*\)-DP for some \(\varepsilon_*<\varepsilon\)?
Suppose we have neighbouring datasets \(x,x’\in\mathcal{X}^n\) such that \(f_j(x) = f_j(x’)+1\) for each \(j \in [k]\). Let \(y=(a,a,\cdots,a) \in \mathbb{R}^k\) for some \(a \ge \max_{j=1}^k f_j(x)\).
Then
\[
\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\prod_{j=1}^k \mathbb{P}[f_j(x)+\mathsf{Laplace}(k/\varepsilon)=y_j]}{\prod_{j=1}^k \mathbb{P}[f_j(x’)+\mathsf{Laplace}(k/\varepsilon)=y_j]}
\]
\[
= \prod_{j=1}^k \frac{\frac{\varepsilon}{2k}\exp\left(-\frac{\varepsilon}{k} |y_j-f_j(x)| \right)}{\frac{\varepsilon}{2k}\exp\left(-\frac{\varepsilon}{k} |y_j-f_j(x’)| \right)}
= \prod_{j=1}^k \frac{\exp\left(-\frac{\varepsilon}{k} (y_j-f_j(x)) \right)}{\exp\left(-\frac{\varepsilon}{k} (y_j-f_j(x’)) \right)}
\]
\[
= \prod_{j=1}^k \exp\left(\frac{\varepsilon}{k}\left(f_j(x)-f_j(x’)\right)\right)
= \exp\left( \frac{\varepsilon}{k} \sum_{j=1}^k \left(f_j(x)-f_j(x’)\right)\right)= e^\varepsilon,
\]
where the third equality removes the absolute values because \(y_j \ge f_j(x)\) and \(y_j \ge f_j(x’)\).
This shows that basic composition is optimal. For this example, we cannot prove a better guarantee than what is given by basic composition.</p>
<p>Is there some other way to improve upon basic composition that circumvents this example? Note that we assumed that there are neighbouring datasets \(x,x’\in\mathcal{X}^n\) such that \(f_j(x) = f_j(x’)+1\) for each \(j \in [k]\). In some settings, no such worst case datasets exist. In that case, instead of scaling the noise linearly with \(k\), we can scale the Laplace noise according to the \(\ell_1\) sensitivity \(\Delta_1 := \sup_{x,x’ \in \mathcal{X}^n \atop \text{neighbouring}} \sum_{j=1}^k |f_j(x)-f_j(x’)|\).</p>
<p>Instead of adding assumptions to the problem, we will look more closely at the example above.
We showed that there exists some output \(y \in \mathbb{R}^d\) such that \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = e^\varepsilon\).
However, such outputs \(y\) are very rare, as we require \(y_j \ge \max\{f_j(x),f_j(x’)\}\) for each \(j \in [k]\) where \(y_j = f_j(x) + \mathsf{Laplace}(k/\varepsilon)\). Thus, in order to observe an output \(y\) such that the likelihood ratio is maximal, all of the \(k\) Laplace noise samples must be positive, which happens with probability \(2^{-k}\).
The fact that outputs \(y\) with maximal likelihood ratio are exceedingly rare turns out to be a general phenomenon and not specific to the example above.</p>
<p>Can we improve on basic composition if we only ask for a high probability bound? That is, instead of demanding \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^{\varepsilon_*}\) for all \(y \in \mathcal{Y}\), we demand \(\mathbb{P}_{Y \gets M(x)}\left[\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \le e^{\varepsilon_*}\right] \ge 1-\delta\) for some \(0 < \delta \ll 1\). Can we prove a better bound \(\varepsilon_* < \varepsilon\) in this relaxed setting? The answer turns out to be yes.</p>
<p>The limitation of pure \(\varepsilon\)-DP is that events with tiny probability – which are negligible in real-world applications – can dominate the privacy analysis. This motivates us to move to relaxed notions of differential privacy, such as approximate \((\varepsilon,\delta)\)-DP and concentrated DP, which are less sensitive to low probability events.</p>
<h2 id="preview-advanced-composition">Preview: Advanced Composition</h2>
<p>By moving to approximate \((\varepsilon,\delta)\)-DP with \(\delta>0\), we can prove an asymptotically better composition theorem, which is known as <em>the advanced composition theorem</em> <strong><a href="https://ieeexplore.ieee.org/document/5670947" title="Cynthia Dwork, Guy Rothblum, Salil Vadhan. Boosting and Differential Privacy. FOCS 2010.">[DRV10]</a></strong>.</p>
<blockquote>
<p><strong>Theorem</strong> (Advanced Composition Starting from Pure DP<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>)
Let \(M_1, M_2, \cdots, M_k : \mathcal{X}^n \to \mathcal{Y}\) be randomized algorithms. Suppose \(M_j\) is \(\varepsilon_j\)-DP for each \(j \in [k]\).
Define \(M : \mathcal{X}^n \to \mathcal{Y}^k\) by \(M(x)=(M_1(x),M_2(x),\cdots,M_k(x))\), where each algorithm is run independently. Then \(M\) is \((\varepsilon,\delta)\)-DP for any \(\delta>0\) with \[\varepsilon = \frac12 \sum_{j=1}^k \varepsilon_j^2 + \sqrt{2\log(1/\delta) \sum_{j=1}^k \varepsilon_j^2}.\]</p>
</blockquote>
<p>Recall that basic composition gives \(\delta=0\) and \(\varepsilon = \sum_{j=1}^k \varepsilon_j\). That is, basic composition scales with the 1-norm of the vector \((\varepsilon_1, \varepsilon_2, \cdots, \varepsilon_k)\), whereas advanced composition scales with the 2-norm of this vector (and the squared 2-norm).
Neither bound strictly dominates the other. However, asymptotically (in a sense we will make precise in the next paragraph) advanced composition dominates basic composition.</p>
<p>Suppose we have a fixed \((\varepsilon,\delta)\)-DP guarantee for the entire system and we must answer \(k\) queries of sensitivity \(1\).
Using basic composition, we can answer each query by adding \(\mathsf{Laplace}(k/\varepsilon)\) noise to each answer.
However, using advanced composition, we can answer each query by adding \(\mathsf{Laplace}(\sqrt{k/2\rho})\) noise to each answer, where<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>
\[\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}.\]
If the privacy parameters \(\varepsilon,\delta>0\) are fixed (which implies \(\rho\) is fixed) and \(k \to \infty\), we can see that asymptotically advanced composition gives noise per query scaling as \(\Theta(\sqrt{k})\), while basic composition results in noise scaling as \(\Theta(k)\).</p>
<p> </p>
<p>In the next few posts we will explain how advanced composition works. We hope this conveys an intuitive understanding of composition and, in particular, how this \(\sqrt{k}\) asymptotic behaviour arises. If you want to read ahead, these posts are extracts from <a href="https://arxiv.org/abs/2210.00597">this book chapter</a>.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>This result generalizes to approximate DP. If instead we assume \(M_j\) is \((\varepsilon_j,\delta_j)\)-DP for each \(j \in [k]\), then the final composition is \((\varepsilon,\delta+\sum_{j=1}^k \delta_j)\)-DP with \(\varepsilon\) as before. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Adding \(\mathsf{Laplace}(\sqrt{k/2\rho})\) noise to a sensitivity-1 query ensures \(\varepsilon_j\)-DP for \(\varepsilon_j = \sqrt{2\rho/k}\). Hence \(\sum_{j=1}^k \varepsilon_j^2 = 2\rho\). Setting \(\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}\) ensures that \(\frac12 \sum_{j=1}^k \varepsilon_j^2 + \sqrt{2\log(1/\delta) \sum_{j=1}^k \varepsilon_j^2} = \rho + \sqrt{4\rho\log(1/\delta)} \le \varepsilon\). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Thomas SteinkeTue, 01 Nov 2022 11:45:00 -0400
https://differentialprivacy.org/composition-basics/
https://differentialprivacy.org/composition-basics/Privacy Doona: Why We Should Hide Among The Clones<p>In this blog post, we will discuss a recent(ish) result of Feldman, McMillan, and Talwar <a href="https://arxiv.org/abs/2012.12803" title="Vitaly Feldman, Audra McMillan, Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021"><strong>[FMT21]</strong></a>, which provides an improved and simple analysis of the so-called “amplification by shuffling” formally connecting local privacy (LDP) and shuffle privacy.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Now, I’ll assume the reader is familiar with both LDP and Shuffle DP: if not, a quick-and-dirty refresher (with less quick, and less dirty references) can be found <a href="\trustmodels">here</a>, and of course there is also Albert Cheu’s excellent survey on Shuffle DP <a href="https://arxiv.org/abs/2107.11839" title="Albert Cheu. Differential Privacy in the Shuffle Model: A Survey of Separations. arXiv 2021"><strong>[Cheu21]</strong></a>.</p>
<p>I will also ignore most of the historical details, but it is worth mentioning that <a href="https://arxiv.org/abs/2012.12803" title="Vitaly Feldman, Audra McMillan, Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021"><strong>[FMT21]</strong></a> is not the first paper on this “amplification by shuffling,” (which, for local reasons, I’ll just call a <em>privacy <a href="https://www.collinsdictionary.com/dictionary/english/doona">doona</a></em>) but rather is the culmination of a rather long line of work involving many cool ideas and papers, starting with <a href="https://arxiv.org/abs/1808.01394" title="Albert Cheu, Adam D. Smith, Jonathan Ullman, David Zeber, Maxim Zhilyaev. Distributed Differential Privacy via Shuffling. EUROCRYPT 2019"><strong>[CSUZZ19</strong></a>, <a href="https://arxiv.org/abs/1811.12469" title="Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, Abhradeep Thakurta. Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity. SODA 2019"><strong>EFMRTT19]</strong></a>: I’d refer the reader to <strong>Table 1</strong> in <a href="https://arxiv.org/abs/2012.12803" title="Vitaly Feldman, Audra McMillan, Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021"><strong>[FMT21]</strong></a> for an overview.</p>
<p>Alright, now that the caveats are behind us, what <em>is</em> “amplification by shuffling”? In a nutshell, it is capturing the (false!) intuition that “anonymization provides privacy” (which, again, is false! Don’t do this!) and making it… less false. The idea is that while <em>anonymization does not provide in itself any meaningful privacy guarantee</em>, it can <em>amplify existing, rigorous privacy guarantee</em>. So if I start with a somewhat lousy LDP guarantee, but then all the messages sent by all users are completely anonymized, then my lousy LDP guarantee suddenly gets <em>much</em> stronger (roughly speaking, the \(\varepsilon\) parameter goes down with the square root of of the number of users involved). Which is wonderful! Let’s see what this means, quantitatively.</p>
<h3 id="the-result-of-feldman-mcmillan-and-talwar">The result of Feldman, McMillan, and Talwar.</h3>
<p>Here, we will focus on the simpler case of <em>noninteractive</em> protocols (one-shot messages from the users to the central server, no funny business with messages going back and forth); which is conceptually simpler to state and parse, still very rich and interesting, and, well, very relevant in practice (being the easiest and cheapest to deploy). If you want the results in their full glorious generality, though, they are in the paper.</p>
<p>What the main theorem of <a href="https://arxiv.org/abs/2012.12803" title="Vitaly Feldman, Audra McMillan, Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021"><strong>[FMT21]</strong></a> is saying for this noninteractive setting can then be stated as follows: if I have an <em>\(\varepsilon_L\)-locally private</em> (LDP) protocol for a task, where all \(n\) users pass their data through the same randomizer (algorithm) \(R\) and send the resulting message \(y_i \gets R(x_i)\), then just permuting the messages \(y_1\dots,y_n\) immediately gives an \((\varepsilon,\delta)\)-<em>shuffle</em> private protocol for the same task, for any pair \((\varepsilon,\delta)\) which satisfies
<a name="eq:eps:epsL"></a>
\begin{equation}
\varepsilon \leq \log\left( 1+ 16\frac{e^{\varepsilon_{L}}-1}{e^{\varepsilon_{L}}+1}\sqrt{\frac{e^{\varepsilon_{L}}\log\frac{4}{\delta}}{n}}\right) \tag{1}
\end{equation}
as long as \(n \gg e^{\varepsilon_{L}}\log(1/\delta)\). That is quite a lot to parse, though: what does this actually <em>mean</em>?</p>
<p><strong>First</strong>, the assumption that all users have the same randomizer (or at least cannot be distinguished by their randomizer) is quite natural: if they didn’t, then we wouldn’t be able to say anything in general, since the randomizer they use could just give away their identity completely. For instance, as an extreme case, the randomizer of user \(i\) could just append \(i\) to the message (it’s OK, still LDP!), and then shuffling achieves exactly nothing: we know who sent what. So OK, asking for all randomizers to be the same is not really a restriction.</p>
<p><strong>Second</strong>, each user only sends one message, and this preserves its length (we just shuffled the messages, didn’t modify them!). So if you start with an LDP protocol with amazing features XYZ (e.g., the messages are \(1\)-bit long, or users don’t share a random seed, or the randomizers run in time \(O(1)\)), then the shuffle protocol enjoys exactly the same properties. (It only enjoys naturally some <em>robustness</em>, in the sense that if \(10\%\) if the \(n\) users maliciously deviate from the protocol, they can’t really jeopardize the privacy of the remaining \(90\%\) of users.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> Which is… good.)</p>
<p><strong>Third</strong>, this is inherently approximate DP. Here we started with pure LDP (you can also extend that to approximate LDP) and ended up with approximate Shuffle DP: this is not a mistake, that’s how it is. I am not a purist (erm) myself, and that looks more than good enough to me; but if you seek pure Shuffle DP, then this result is not the droid you’re looking for.</p>
<p align="center">
<img src="../images/droids-looking.png" width="50%" alt="This is not the pure DP guarantee you are looking for." />
</p>
<p><br /></p>
<p>Alright, <em>what</em> is this guarantee stated in <a href="#eq:eps:epsL">(1)</a> giving us? Let’s interpret the expression in <a href="#eq:eps:epsL">(1)</a> in two parameter regimes, focusing on \(\varepsilon\) (fixing some small \(\delta>0\)). If we start with \(\varepsilon_{L} \ll 1\) for our LDP randomizers \(R\), then a first-order Taylor expansion shows that we get
\[
\varepsilon \approx \varepsilon_{L}\cdot 8\sqrt{\frac{\log\frac{4}{\delta}}{n}}
\]
so that <em>shuffling improved our privacy parameter by a factor \(\sqrt{n}\)</em>.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> 😲 This is great! With more users, comes more privacy!</p>
<p>But that was starting with small \(\varepsilon_{L}\), that is, already pretty good privacy guarantees for our LDP “building block” \(R\). What happens if we start with “somewhat lousy” privacy guarantees, that is, \(\varepsilon_{L} \gg 1\)? Do we get anything interesting then?
Another Taylor expansion (everything is a Taylor expansion) shows us that, then,
<a name="eq:epsL:ll:one"></a>
\[
\begin{equation}
\varepsilon \approx \log\left( 1+ 8\sqrt{\frac{e^{\varepsilon_{L}}\log\frac{4}{\delta}}{n}}\right) \tag{2}
\end{equation}
\]
or, put differently,
<a name="eq:epsL:gg:one"></a>
\[
\begin{equation}
\varepsilon \approx 8e^{\varepsilon_{L}/2}\sqrt{\frac{\log\frac{4}{\delta}}{n}} \tag{3}
\end{equation}
\]
That’s a bit harder to interpret, but that seems… useful? It is: let us see how much, with a couple examples.</p>
<h4 id="learning">Learning.</h4>
<p>The first one is distribution learning, a.k.a. density estimation: you have \(n\) i.i.d. samples (one per user) from an unknown probability distribution \(\mathbf{p}\) over a discrete domain of size \(k\), and your goal is to output an estimate \(\widehat{\mathbf{p}}\) such that, with high (say, constant) probability, \(\mathbf{p}\) and \(\widehat{\mathbf{p}}\) are close in <em>total variation distance</em>:
\[
\operatorname{TV}(\mathbf{p},\widehat{\mathbf{p}}) = \sup_{S\subseteq [k]} (\mathbf{p}(S) - \widehat{\mathbf{p}}(S) ) \leq \alpha
\]
(if total variation distance seems a bit mysterious, it’s exactly half the \(\ell_1\) distance between the probability mass functions). We know how to solve this problem in the non-private setting: \(n=\Theta\left( \frac{k}{\alpha^2} \right)\) samples are necessary and sufficient. We know how to solve this problem in the (central) DP setting: \(n=\Theta\left( \frac{k}{\alpha^2} + \frac{k}{\alpha\varepsilon} \right)\) samples are necessary and sufficient <a href="https://proceedings.neurips.cc/paper/2015/hash/2b3bf3eee2475e03885a110e9acaab61-Abstract.html" title="Ilias Diakonikolas, Moritz Hardt, Ludwig Schmidt. Differentially Private Learning of Structured Discrete Distributions. NeurIPS 2015"><strong>[DHS15]</strong></a>. We know how to solve this problem in the LDP setting:
<a name="eq:learning:ldp"></a>
\begin{equation}
n=\Theta\left(\frac{k^2}{\alpha^2(e^\varepsilon-1)^2}+\frac{k^2}{\alpha^2e^\varepsilon}+\frac{k}{\alpha^2}\right) \tag{4}
\end{equation}
samples are necessary and sufficient <a href="http://proceedings.mlr.press/v89/acharya19a.html" title="Jayadev Acharya, Ziteng Sun, Huanyu Zhang. Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication. AISTATS 2019"><strong>[ASZ19]</strong></a> (note that the first term is just \(k/(\alpha^2\varepsilon^2)\) for small \(\varepsilon\)). Now, as they say in Mulan: <em>let’s make a shuffle DP algo out of you.</em></p>
<p>If we want to achieve \((\varepsilon,\delta)\)-shuffle DP, we need to select \(\varepsilon_L\). Based on <a href="#eq:epsL:ll:one">(2)</a> and <a href="#eq:epsL:gg:one">(3)</a>, and ignoring pesky constants we will choose it so that
<a name="eq:choice:epsL"></a>
\begin{equation}
\varepsilon_{L} \approx \varepsilon \sqrt{\frac{n}{\log(1/\delta)}} \quad\text{ or }\quad e^{\varepsilon_{L}} \approx \varepsilon^2 \cdot \frac{n}{\log(1/\delta)}\,. \tag{5}
\end{equation}
depending on whether \(\frac{\varepsilon^2 n}{\log(1/\delta)}\geq 1\). Plugging that back in <a href="#eq:learning:ldp">(4)</a>, we see that the first case corresponds to the first term (small \(\varepsilon_{L}\)) and the second to the second term (\(\varepsilon_{L} \geq 1\)), and overall the condition on \(n\) for the original LDP algorithm to
successful learn the distribution becomes
\[
n \gtrsim
\frac{k^2}{\alpha^2(e^{\varepsilon_{L}}-1)^2}+\frac{k^2}{\alpha^2e^{\varepsilon_{L}}}+\frac{k}{\alpha^2}
\approx \frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k}{\alpha^2}
\approx \frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k}{\alpha^2}
\]
(where \(\gtrsim\) means “let’s ignore constants”). There is an \(n\) in the RHS as well, so reorganizing and handling the two terms separately the condition on \(n\) becomes
\[
n \gtrsim \frac{k \sqrt{\log(1/\delta)}}{\alpha\varepsilon}+\frac{k}{\alpha^2}
\]
which… is great? We immediately get a sample complexity \(O\left(\frac{k}{\alpha^2}+\frac{k \sqrt{\log(1/\delta)}}{\alpha\varepsilon}\right)\) in the shuffle DP model, which (ignoring the \(\sqrt{\log(1/\delta)}\)) matches the one in the <em>central</em> DP setting!</p>
<p><strong>tl;dr:</strong> Taking an optimal LDP algorithm and just shuffling the messages <em>immediately</em> gives an optimal shuffle DP algorithm, no extra work needed.</p>
<h4 id="uniformity-testing">(Uniformity) Testing.</h4>
<p>Alright, maybe it was a fluke? Let’s look at another “basic” problem close to my heart: we don’t want to learn the probability distribution \(\mathbf{p}\), just test whether it is actually <em>the</em> uniform distribution<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> \(\mathbf{u}\) on the domain \([k]={1,2,\dots,k}\). So if \(\mathbf{p} =\mathbf{u}\), you’ve got to say “yes” with probability at least \(2/3\), and if \(\operatorname{TV}(\mathbf{p},\mathbf{u})>\alpha\), then you need to say “no” with probability at least \(2/3\).</p>
<p>This is also well understood in the non-private setting (\(n=\Theta(\sqrt{k}/\alpha^2)\)) <a href="https://ieeexplore.ieee.org/document/4626074" title="Liam Paninski. A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data. IEEE Trans. Inf. Theory 2008"><strong>[Paninski08]</strong></a> [see also <a href="https://ccanonne.github.io/survey-topics-dt.html}{my upcoming survey">my upcoming survey</a>], in the central DP setting (\(n=\Theta\left( \frac{\sqrt{k}}{\alpha^2} + \frac{\sqrt{k}}{\alpha\sqrt{\varepsilon}}+\frac{k^{1/3}}{\alpha^{4/3}\varepsilon^{2/3}} + \frac{1}{\alpha\varepsilon} \right)\)) <a href="https://arxiv.org/abs/1707.05128" title="Jayadev Acharya, Ziteng Sun, Huanyu Zhang. Differentially Private Testing of Identity and Closeness of Discrete Distributions. NeurIPS 2018"><strong>[ASZ18</strong></a>, <a href="https://arxiv.org/abs/1707.05497" title="Maryam Aliakbarpour, Ilias Diakonikolas, Ronitt Rubinfeld. Differentially Private Identity and Equivalence Testing of Discrete Distributions. ICML 2018"><strong>ADR18]</strong></a>, and in the LDP setting, where the result differs on whether the users can communicate or share a common random seed
<a name="eq:testing:ldp:publiccoin"></a>
\begin{equation}
n=\Theta\left( \frac{k}{\alpha^2(e^\varepsilon-1)^2} + \frac{k}{\alpha^2e^{\varepsilon/2}} + \frac{\sqrt{k}}{\alpha^2}\right) \tag{6}
\end{equation} or not
<a name="eq:testing:ldp:privatecoin"></a>
\begin{equation}
n=\Theta\left( \frac{k^{3/2}}{\alpha^2(e^\varepsilon-1)^2} + \frac{k^{3/2}}{\alpha^2e^{\varepsilon}} + \frac{\sqrt{k}}{\alpha^2}\right) \tag{7}
\end{equation}
as established in a sequence of papers <a href="https://arxiv.org/abs/1812.11476" title="Inference under Information Constraints: Lower Bounds from Chi-Square Contraction. COLT 2019"><strong>[ACT19</strong></a>, <a href="https://arxiv.org/abs/1911.01452" title="Kareem Amin, Matthew Joseph, Jieming Mao. Pan-Private Uniformity Testing. COLT 2020"><strong>AJM20</strong></a>, <a href="https://arxiv.org/abs/2101.07981" title="Jayadev Acharya, Clément L. Canonne, Cody Freitag, Ziteng Sun, Himanshu Tyagi.
Inference Under Information Constraints III: Local Privacy Constraints. IEEE J. Sel. Areas Inf. Theory 2021"><strong>ACFST21</strong></a>, <a href="https://arxiv.org/abs/2007.10976" title="Jayadev Acharya, Clément L. Canonne, Yuhan Liu, Ziteng Sun, Himanshu Tyagi. Interactive Inference Under Information Constraints. IEEE Trans. Inf. Theory 2022"><strong>ACLST22</strong></a>, <a href="https://arxiv.org/abs/2108.08987" title="Clément L. Canonne, Hongyi Lyu. Uniformity Testing in the Shuffle Model: Simpler, Better, Faster. SOSA 2022"><strong>CL22]</strong></a>.</p>
<p>Now, say you want an (\(\varepsilon,\delta)\)-shuffle DP algorithm for uniformity testing, but don’t want to design one from scratch (though it <em>is</em> possible to do so, and some did <a href="https://arxiv.org/abs/2004.09481" title="Victor Balcer, Albert Cheu, Matthew Joseph, Jieming Mao. Connecting Robust Shuffle Privacy and Pan-Privacy. SODA 2021"><strong>[BCJM21</strong></a>, <a href="https://arxiv.org/abs/2108.08987" title="Clément L. Canonne, Hongyi Lyu. Uniformity Testing in the Shuffle Model: Simpler, Better, Faster. SOSA 2022"><strong>CL22</strong></a>, <a href="https://arxiv.org/abs/2112.10032" title="Albert Cheu, Chao Yan. Pure Differential Privacy from Secure Intermediaries. arXiv 2021"><strong>CY21]</strong></a>). Let’s say you want to look at the “no-common-random-seed-shared-by-users” model (a.k.a. <em>private-coin</em> setting): so you stare at the corresponding LDP communication complexity, <a href="#eq:testing:ldp:privatecoin">(7)</a>, and try to choose \(\varepsilon_L\) to start with before shuffling. This will be the same as in the learning example (i.e., <a href="#eq:choice:epsL">(5)</a>): based on <a href="#eq:epsL:ll:one">(2)</a> and <a href="#eq:epsL:gg:one">(3)</a>, we will set
\begin{equation}
\varepsilon_{L} \approx \varepsilon \sqrt{\frac{n}{\log(1/\delta)}} \quad\text{ or }\quad e^{\varepsilon_{L}} \approx \varepsilon^2 \cdot \frac{n}{\log(1/\delta)}\,.
\end{equation}
depending on whether \(\frac{\varepsilon^2 n}{\log(1/\delta)}\geq 1\). Plugging this back in <a href="#eq:testing:ldp:privatecoin">(7)</a> and quickly checking which case corresponds to each term, we then easily get that for our algorithm to correctly solve the uniformity testing problem, it suffices that the sample complexity (number of users) \(n\) satisfies
\[
n \gtrsim \frac{k^{3/2}}{\alpha^2(e^{\varepsilon_L}-1)^2} + \frac{k^{3/2}}{\alpha^2e^{\varepsilon_L}} + \frac{\sqrt{k}}{\alpha^2}
\approx \frac{k^{3/2}\log(1/\delta)}{\alpha^2\varepsilon^2 n } + \frac{\sqrt{k}}{\alpha^2}
\]
which, reorganizing and solving for \(n\), means that it suffices to have
\[
n \gtrsim \frac{k^{3/4}\sqrt{\log(1/\delta)}}{\alpha\varepsilon} + \frac{\sqrt{k}}{\alpha^2}\,.
\]
And, <em>voilà</em>! Even better, we also have strong evidence to suspect that this sample complexity \(O\Big(\frac{k^{3/4}\sqrt{\log(1/\delta)}}{\alpha\varepsilon}+ \frac{\sqrt{k}}{\alpha^2}\Big)\) is tight among all private-coin algorithms.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>Now, if you wanted to look at <em>public-coin</em> shuffle DP protocols (with a common random seed available), then you would start with an optimal public-coin LDP algorithm (and look at <a href="#eq:testing:ldp:publiccoin">(6)</a>), and setting \(\varepsilon_L\) the same way you’d get a shuffle DP algorithm with sample complexity
\[
n=O\Big(\frac{k^{2/3}\log^{1/3}(1/\delta)}{\alpha^{4/3}\varepsilon^{2/3}} + \frac{\sqrt{k\log(1/\delta)}}{\alpha\varepsilon}+ \frac{\sqrt{k}}{\alpha^2}\Big)
\]
which, well, is <em>also</em> strongly believed to be optimal!</p>
<p><strong>tl;dr:</strong> Here again, taking an optimal off-the-shelf LDP algorithm and just shuffling the messages <em>immediately</em> gives an optimal shuffle DP algorithm, no extra work needed.</p>
<h3 id="conclusion">Conclusion.</h3>
<p>I hope the above convinced you of how useful this privacy amplification can be: from an optimal LDP algorithm, featuring any extra appealing characteristics you like, <em>just adding an extra shuffling step as postprocessing</em> yields an (often optimal? At least good) shuffle DP algorithm, <em>with the same characteristics</em> and built-in robustness against malicious users.</p>
<p>All you need is to make sure that your starting point, the LDP algorithm satisfies a couple things: (1) all users have the same randomizer,<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> and (2) it works in all regimes of \(\varepsilon\) (both high-privacy, \(\varepsilon \leq 1\), <em>and</em> low-privacy, \(\varepsilon \gg 1\)). Once you’ve got this, Bob’s your uncle! You get shuffle DP algorithms for free.</p>
<p>It is not only appealing from a theoretical point of view, by the way! The authors of the paper worked hard to make their empirical analysis compelling as well, and their code is available <a href="https://github.com/apple/ml-shuffling-amplification">on GitHub</a> 📝. But more importantly, from a practitioner’s point of view, this means it is enough to design, implement, and test <em>one</em> algorithm (the LDP one we start with) to automatically get a trusted one in the shuffle DP model as well: this reduces the risks of bugs, security failures, the amount of work spending tuning, testing…</p>
<p>So yes, whenever possible, we <em>should</em> hide among the clones!</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The title of this post is a reference to the title of <a href="https://arxiv.org/abs/2012.12803" title="Vitaly Feldman, Audra McMillan, Kunal Talwar. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021"><strong>[FMT21]</strong></a>, “Hiding Among The Clones,” and to the notion of <em>privacy blanket</em> introduced by Balle, Bell, Gascón, and Nissim <a href="https://link.springer.com/chapter/10.1007/978-3-030-26951-7_22" title="Borja Balle, James Bell, Adrià Gascón, Kobbi Nissim. The Privacy Blanket of the Shuffle Model. CRYPTO 2019"><strong>[BBGN19]</strong></a>. Intuitively, the “amplification by shuffling” paradigm can be seen as anonymizing the messages from local randomizers, whose message distribution can be mathematically decomposed as a mixture of “noise distribution not depending on the user’s input” and “distribution actually depending on their input.” As a result, each user randomly sends a message from the first or second distribution of the mixture. But the shuffling then hides the informative messages (drawn from the second part of the mixture) among the non-informative (noise) ones: so the noise messages end up providing a “privacy blanket” in which sensitive information is safely and soundly wrapped. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>More specifically, they can completely jeopardize the <em>utility</em> (accuracy) of the result, but in terms of privacy, all they can do is slightly reduce it: if \(10\%\) of users are malicious, the remaining \(90\%\) still get the privacy amplification of guarantee of <a href="#eq:eps:epsL">(1)</a>, but with \(0.9n\) instead of \(n\). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Of course, we started with a local privacy guarantee, and ended up with a shuffle privacy guarantee: so the two are incomparable, and one has to interpret this “amplification” in that context. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>You can here replace uniform by any known distribution \(\mathbf{q}\) of your choosing, that doesn’t change the question (and result), but uniform is nice. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>As long as one is happy with approximate DP. One can achieve that in pure DP as well, but it’s a bit more complicated <a href="https://arxiv.org/abs/2112.10032" title="Albert Cheu, Chao Yan. Pure Differential Privacy from Secure Intermediaries. arXiv 2021"><strong>[CY21]</strong></a>. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>This is not such a big assumption usually, and there are somewhat-general ways to get to that using a logarithmic factor in the number of users. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Clément CanonneTue, 24 May 2022 11:45:00 -0400
https://differentialprivacy.org/privacy-doona/
https://differentialprivacy.org/privacy-doona/Differentially private deep learning can be effective with self-supervised models<p>Differential Privacy (DP) is a formal definition of privacy which guarantees that the outcome of a statistical procedure does not vary much regardless of whether an individual input is included or removed from the training dataset.
This guarantee is desirable when we are tasked to train machine learning models on private datasets that should not memorize individual inputs.
Past works have shown that differentially private models can be resilient to strong membership inference [<a href="https://proceedings.mlr.press/v37/kairouz15.html">1</a>, <a href="https://ieeexplore.ieee.org/abstract/document/9519424">34</a>, <a href="https://proceedings.neurips.cc/paper/2020/hash/fc4ddc15f9f4b4b06ef7844d6bb53abf-Abstract.html">35</a>] and data reconstruction attacks [<a href="https://www.usenix.org/conference/usenixsecurity19/presentation/carlini">2</a>, <a href="https://arxiv.org/abs/2201.12383">3</a>] when the privacy parameter is set to be sufficiently small.
See a <a href="https://differentialprivacy.org/how-to-deploy-ml-with-dp/">prior post</a> for more background on differentially private machine learning.</p>
<p>Yet, in practice, most attempts at training differentially private deep learning models on moderately-sized datasets have resulted in large performance drops compared to when training without privacy-protection baked in.
These performance drops are oftentimes large enough to discourage the adoption of differential privacy protection into machine learning pipelines altogether.</p>
<p>To provide a reference of the potential performance hit, the authors of [<a href="https://arxiv.org/abs/2102.12677">5</a>] trained a ResNet-20 from scratch on CIFAR-10 with a privacy budget of \(\epsilon=8\) that has test accuracy barely over 62% (see their Table 1).
Contrast this with the 8.75% error rate (91.25% accuracy) reported for training the same architecture without enforcing differential privacy [<a href="https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html">6</a>].
While some works report private learning results better than the above, absent additional data, pre-training, or external knowledge, most improvements have been incremental, and the test accuracy for CIFAR-10 models trained under modest privacy leakage (\(\epsilon=3\)) has roughly settled to ~70% in the literature [<a href="https://arxiv.org/abs/2011.11660">4</a>].</p>
<p>One reason behind the performance drop lies in sample efficiency — differentially private learning generally requires much more data than non-private learning to reach an acceptable level of performance.
This also means that learning the high-level features (e.g., syntactic structure in text, edge detectors for images) necessary to perform specific tasks with private data can be much more sample-costly.</p>
<p>This blog post surveys results that leverage public self-supervised pre-training to obtain high-performing models through differentially private fine-tuning.
The pre-train-fine-tune paradigm is straightforward to execute and results in high-performing models under modest privacy budgets for many standard computer vision and natural language processing tasks.
Moreover, existing results have shown that private fine-tuning consistently benefits from improvements in public pre-training.</p>
<h2 id="self-supervised-pre-training">Self-Supervised Pre-Training</h2>
<p>Self-supervised learning is a paradigm which leverages unlabeled data to learn representations that can be useful for a range of downstream tasks.
Since self-supervised learning doesn’t target specific tasks itself,
the (pre-)training procedure doesn’t require labeled data — in many cases, mildly curated unlabeled data is sufficient for self-supervised pre-training to produce models for subsequent fine-tuning.
So far, there have been two broadly successful instantiations of this learning paradigm in computer vision [<a href="http://proceedings.mlr.press/v119/chen20j.html">9</a>] and natural language processing [<a href="https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf">7</a>, <a href="https://arxiv.org/abs/1810.04805">8</a>].
We recap the two approaches below.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p><strong>Contrastive pre-training for vision:</strong>
One class of self-supervised methods in computer vision (SimCLR, [<a href="http://proceedings.mlr.press/v119/chen20j.html">9</a>]) performs pre-training through contrastive learning.
Algorithms of this type produce embeddings for images with the goal of creating different embeddings for semantically different images and similar embeddings for similar ones.
Concretely, the algorithm used in SimCLR forces models to produce similar embeddings for an image and its augmented siblings (e.g., image rotated by some degrees),
and different embeddings for separate images (and their augmentations).
The SimCLR framework with large scale models and compute led to state-of-the-art (non-private) ImageNet fine-tuning results at the time of its writing.</p>
<p><strong>Masked language modeling and autoregressive language modeling for text:</strong>
Masked Language Modeling (MLM) and Auto-regressive Language Modeling (ALM) are two self-supervised pre-training approaches.
While the former asks models to predict deliberately masked out tokens from a piece of text, the latter asks models to simply predict the next token in a sequence.
With large amounts of unlabeled text data, large and expressive Transformer models [<a href="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">24</a>], and lots of compute, both approaches produce powerful models that are good starting points for downstream fine-tuning.
For instance, Bidirectional Encoder Representations from Transformers (BERT, [<a href="https://arxiv.org/abs/1810.04805">8</a>]), produced state-of-the-art (non-private) results (at the time) for a large collection of language understanding tasks when fine-tuned on each.</p>
<h2 id="fine-tuning-self-supervised-models-with-dp-optimization">Fine-Tuning Self-Supervised Models With DP-Optimization</h2>
<p>Self-supervised pre-training is appealing in the context of differentially private machine learning.
This is because (i) the mildly curated data needed for pre-training can usually be obtained cheaply from the public domain, and (ii) pre-trained models may contain useful domain knowledge that can reduce the sample complexity of subsequent private learning.
A paradigm for private learning that leverages self-supervised pre-training could follow two steps:</p>
<ul>
<li>collect cheap and public (unlabeled) data from the task domain (e.g., vision, language, etc.) to pre-train a model with self-supervised learning, and</li>
<li>collect moderate amounts of task-specific private (labeled) data and fine-tune the pre-trained model under differential privacy to perform the task.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></li>
</ul>
<p>To date, some of the best differentially private deep learning results in the literature have resulted from instantiating this paradigm [<a href="https://arxiv.org/abs/2011.11660">4</a>, <a href="https://arxiv.org/abs/2110.05679">11</a>, <a href="https://arxiv.org/abs/2110.06500">12</a>].
Below, we review works which capitalize on self-supervised pre-training by differentially privately fine-tuning pre-trained models with an iterative gradient method like DP-SGD [<a href="https://dl.acm.org/doi/abs/10.1145/2976749.2978318">19</a>, <a href="https://ieeexplore.ieee.org/abstract/document/6736861">20</a>].<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
<img src="/images/fine-tuning-paradigm.png" alt="" /></p>
<p><strong>Private fine-tuning with SimCLR features:</strong>
The authors of [<a href="https://arxiv.org/abs/2011.11660">4</a>] fine-tuned a linear model on top of the embedding vectors produced by SimCLRv2 from the CIFAR-10 dataset. Under a privacy budget of \(\epsilon=2\),
these models reached an average test accuracy of 92.7%. This number can be further improved to ~94% with the use of larger and wider pre-trained models in the SimCLRv2 family.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>
These test accuracies are very close to some standard non-private results attained by an off-the-shelf ResNet architecture [<a href="https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html">6</a>].</p>
<p><strong>Privately fine-tuning BERT variants and GPT-2:</strong>
The authors of [<a href="https://arxiv.org/abs/2110.05679">11</a>, <a href="https://arxiv.org/abs/2110.06500">12</a>, <a href="http://proceedings.mlr.press/v139/yu21f.html">16</a>] showed that with appropriate hyper-parameters, fine-tuning BERT variants and GPT-2 with DP-optimization results in high-performing private models for text classification and language generation — even on datasets of modest sizes and under modest privacy budgets.
Notably, some of these models attain a task performance close to non-private models from previous years in the literature.
These results also exceed many non-private learning results from the pre-BERT and pre-GPT years.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>More interestingly, the authors showed that the larger (and thus better) the pre-trained model, the better the private fine-tuning performance gets.
This empirical observation in private fine-tuning of large Transformers is qualitatively different from what’s implied by the usual minimax optimal rates derived for vanilla private learning with convex loss functions under approximate differential privacy [<a href="https://ieeexplore.ieee.org/abstract/document/6979031">14</a>, <a href="https://proceedings.neurips.cc/paper/2019/hash/3bd8fdb090f1f5eb66a00c84dbc5ad51-Abstract.html">15</a>].
This discrepancy between experimental results for training large models and the theory for learning with convex losses suggests there is more to be understood.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>Overall, for both vision and language tasks, private learning performance has consistently improved with the improvement in the quality of pre-training,
where the latter is measured by the non-private fine-tuning performance.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup></p>
<p float="left">
<img src="../images/figure1_classification.png" width="48%" />
<img src="../images/figure1_generation.png" width="48%" />
<caption>Figure 1: Privately fine-tuning better (and larger) pre-trained models lead to consistently improving performance for text classification and language generation.
Left: text classification on MNLI [<a href="https://arxiv.org/abs/1704.05426">25</a>]. Right: language generation on E2E [<a href="https://arxiv.org/abs/1706.09254">26</a>].</caption>
</p>
<h2 id="conclusion-and-outlook">Conclusion and Outlook</h2>
<p>We surveyed recent works in the literature that obtained highly performant private machine learning models leveraging self-supervised pre-training.
Common to these results is the trend that the performance of private learning consistently improved with the quality of public pre-training.
We therefore anticipate that the general paradigm may be useful in additional settings (e.g., federated learning) and tasks (e.g., private synthetic image generation), and lead to better private learning results.</p>
<p>We have thus far assumed that the data for public pre-training can be cheaply obtained.
This, however, does not imply that determining whether a particular source of data is appropriate for public pre-training is an easy problem.
Using publicly available data is not necessarily risk-free in terms of privacy.
For instance, the authors of [<a href="https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting">33</a>] were able to extract personally identifiable information from a GPT-2 model pre-trained on data scraped from the public internet.</p>
<p>Self-supervised pre-training has led to progress in private deep learning, but leveraging pre-trained models alone will not address several fundamental challenges to differentially private learning.
First and foremost, the datasets of machine learning tasks may be sampled from long-tailed distributions [<a href="https://proceedings.neurips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html">21</a>].
When privately trained on such datasets, a machine learning model may fail to acquire the learning signal necessary to perform accurate predictions for examples on the tail [<a href="https://dl.acm.org/doi/abs/10.1145/3442188.3445934">28</a>] or from underrepresented (sub)populations [<a href="https://proceedings.neurips.cc/paper/2019/hash/fc0de4e0396fff257ea362983c2dda5a-Abstract.html">29</a>].
Second, many machine learning problems are in a domain where public data (even unlabeled data) may be sparse, e.g., medical imaging.
Developing refined versions of the pre-train-fine-tune approach for problems from these domains is an interesting avenue for future work.</p>
<p>Lastly, differential privacy as one specific definition of privacy may not capture all that’s desired for privacy in reality.
For instance, while differentially private algorithms naturally give machine unlearning guarantees [<a href="https://ieeexplore.ieee.org/abstract/document/9519428">30</a>, <a href="https://ieeexplore.ieee.org/abstract/document/7163042">32</a>], tailored unlearning algorithms tend to have higher capacities of unlearning [<a href="https://proceedings.neurips.cc/paper/2021/hash/9627c45df543c816a3ddf2d8ea686a99-Abstract.html">31</a>].
In addition, what constitutes a record in the differential privacy framework can oftentimes be unclear.
Inappropriately defined example boundaries can create correlated records which cause differential privacy guarantees to degrade [<a href="https://arxiv.org/abs/1603.01508">22</a>].
Moreover, differential privacy guarantees won’t directly prevent the inference of private data outside the original context [<a href="https://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/washlr79&section=16">23</a>].
These are fundamental limitations of differential privacy which improvements to differentially private learning won’t touch on.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The authors thank Nicolas Papernot and Gautam Kamath for detailed feedback and edit suggestions.</p>
<hr />
<h2 id="references">References</h2>
<p>[1] Rahman MA, Rahman T, Laganière R, Mohammed N, Wang Y. Membership Inference Attack against Differentially Private Deep Learning Model. Trans. Data Priv.. 2018 Apr 1;11(1):61-79.</p>
<p>[2] Carlini N, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19) 2019 (pp. 267-284).</p>
<p>[3] Guo C, Karrer B, Chaudhuri K, van der Maaten L. Bounding Training Data Reconstruction in Private (Deep) Learning. arXiv preprint arXiv:2201.12383. 2022 Jan 28.</p>
<p>[4] Tramer F, Boneh D. Differentially private learning needs better features (or much more data). arXiv preprint arXiv:2011.11660. 2020 Nov 23.</p>
<p>[5] Yu D, Zhang H, Chen W, Liu TY. Do not let privacy overbill utility: Gradient embedding perturbation for private learning. arXiv preprint arXiv:2102.12677. 2021 Feb 25.</p>
<p>[6] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition 2016 (pp. 770-778).</p>
<p>[7] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training.</p>
<p>[8] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.</p>
<p>[9] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning 2020 Nov 21 (pp. 1597-1607). PMLR.</p>
<p>[10] Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. 2021 Jan 1.</p>
<p>[11] Li X, Tramer F, Liang P, Hashimoto T. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679. 2021 Oct 12.</p>
<p>[12] Yu D, Naik S, Backurs A, Gopi S, Inan HA, Kamath G, Kulkarni J, Lee YT, Manoel A, Wutschitz L, Yekhanin S. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500. 2021 Oct 13.</p>
<p>[13] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 Jul 26.</p>
<p>[14] Bassily R, Smith A, Thakurta A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In2014 IEEE 55th Annual Symposium on Foundations of Computer Science 2014 Oct 18 (pp. 464-473). IEEE.</p>
<p>[15] Bassily R, Feldman V, Talwar K, Guha Thakurta A. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems. 2019;32.</p>
<p>[16] Yu D, Zhang H, Chen W, Yin J, Liu TY. Large scale private learning via low-rank reparametrization. InInternational Conference on Machine Learning 2021 Jul 1 (pp. 12208-12218). PMLR.</p>
<p>[17] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24;1(8):9.</p>
<p>[18] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, Brynjolfsson E, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. 2021 Aug 16.</p>
<p>[19] Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security 2016 Oct 24 (pp. 308-318).</p>
<p>[20] Song S, Chaudhuri K, Sarwate AD. Stochastic gradient descent with differentially private updates. In2013 IEEE Global Conference on Signal and Information Processing 2013 Dec 3 (pp. 245-248). IEEE.</p>
<p>[21] Feldman V, Zhang C. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems. 2020;33:2881-91.</p>
<p>[22] Ghosh A, Kleinberg R. Inferential privacy guarantees for differentially private mechanisms. arXiv preprint arXiv:1603.01508. 2016 Mar 4.</p>
<p>[23] Nissenbaum H. Privacy as contextual integrity. Wash. L. Rev.. 2004;79:119.</p>
<p>[24] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.</p>
<p>[25] Williams A, Nangia N, Bowman SR. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. 2017 Apr 18.</p>
<p>[26] Novikova J, Dušek O, Rieser V. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254. 2017 Jun 28.</p>
<p>[27] Papernot N, Chien S, Song S, Thakurta A, Erlingsson U. Making the shoe fit: Architectures, initializations, and tuning for learning with privacy.</p>
<p>[28] Suriyakumar VM, Papernot N, Goldenberg A, Ghassemi M. Chasing your long tails: Differentially private prediction in health care settings. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021 Mar 3 (pp. 723-734).</p>
<p>[29] Bagdasaryan E, Poursaeed O, Shmatikov V. Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems. 2019;32.</p>
<p>[30] Bourtoule L, Chandrasekaran V, Choquette-Choo CA, Jia H, Travers A, Zhang B, Lie D, Papernot N. Machine unlearning. In2021 IEEE Symposium on Security and Privacy (SP) 2021 May 24 (pp. 141-159). IEEE.</p>
<p>[31] Sekhari A, Acharya J, Kamath G, Suresh AT. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems. 2021 Dec 6;34.</p>
<p>[32] Cao Y, Yang J. Towards making systems forget with machine unlearning. In2015 IEEE Symposium on Security and Privacy 2015 May 17 (pp. 463-480). IEEE.</p>
<p>[33] Carlini N, Tramer F, Wallace E, Jagielski M, Herbert-Voss A, Lee K, Roberts A, Brown T, Song D, Erlingsson U, Oprea A. Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21) 2021 (pp. 2633-2650).</p>
<p>[34] Nasr M, Songi S, Thakurta A, Papemoti N, Carlin N. Adversary instantiation: Lower bounds for differentially private machine learning. In2021 IEEE Symposium on Security and Privacy (SP) 2021 May 24 (pp. 866-882). IEEE.</p>
<p>[35] Jagielski M, Ullman J, Oprea A. Auditing differentially private machine learning: How private is private sgd?. Advances in Neural Information Processing Systems. 2020;33:22205-16.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Authors of [<a href="https://arxiv.org/abs/2108.07258">18</a>] framed these self-supervised models which are trained on broad data at scale that are adaptable to a wide range of downstream tasks as “foundation models.” <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The idea of privately fine-tuning a publicly pre-trained model certainly isn’t new. One of the first differentially private deep learning papers [<a href="https://arxiv.org/abs/1607.00133">19</a>] considered an experiment which fine-tuned convolutional nets on CIFAR-10 which were pre-trained on CIFAR-100. Results on privately fine-tuning <em>self-supervised</em> models are, on the other hand, more recent. Covering these results is our main focus here. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Blue and pink sphere avatars taken from [<a href="https://arxiv.org/abs/2108.07258">18</a>]. Credit to <a href="https://cs.stanford.edu/~dorarad/">Drew A. Hudson</a> for making these. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Unpublished result. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Hyper-parameters that work well for non-private learning typically aren’t those that work best for differentially private learning [<a href="https://openreview.net/pdf?id=rJg851rYwH">27</a>]. It’s crucial to use a large batch size, a small clipping norm, an appropriate learning rate, and a reasonably large number of training epochs to obtain the mentioned private learning results [<a href="https://arxiv.org/abs/2110.05679">11</a>]. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>In practice, past works have presented mixed results on whether larger models would yield better performance. While some showed that using more filters in a convolutional network can degrade the performance of private learning after some threshold [<a href="https://openreview.net/pdf?id=rJg851rYwH">27</a>], others showed that a larger model can outperform a smaller model from a different model family [<a href="https://arxiv.org/abs/2011.11660">4</a>]. Note these results are conditioned on their particular hyperparameter choices. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>Since the pre-training data for large language models are oftentimes collected through large scale web scraping (e.g., WebText), a common concern is that some training and test instances for downstream tasks may already appear in the pre-training data. Self-supervised pre-training therefore can give models an opportunity to “see” this data even before they are privately fine-tuned. Authors of [<a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">17</a>] confirmed that there is a 1-6% overlap between the test set of many natural language processing tasks and the pre-training data they collected (WebText); these common tasks, however, don’t include those studied by authors of [<a href="https://arxiv.org/abs/2110.05679">11</a>]. The numbers suggest a possibility that existing private fine-tuning results in the literature could be slightly inflated compared to when the pre-training data didn’t contain any instance for any downstream task for which evaluation was performed. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Xuechen LiFlorian TramèrJanardhan KulkarniTatsunori HashimotoTue, 15 Mar 2022 11:00:00 -0800
https://differentialprivacy.org/dp-fine-tuning/
https://differentialprivacy.org/dp-fine-tuning/A simple recipe for private synthetic data generation<p>In the <a href="https://differentialprivacy.org/synth-data-0/">last blog post</a>, we covered the potential pitfalls of synthetic data without formal privacy guarantees, and motivated the need for differentially private synthetic data mechanisms. In this blog post, we will describe the <strong>select-measure-generate</strong> paradigm, which is a simple and effective template for designing synthetic data mechanisms. The three steps underlying the select-measure-generate paradigm are illustrated and explained below.</p>
<p><img src="/images/select-measure-reconstruct.png" alt="" /></p>
<ol>
<li><strong>Select</strong> a collection of queries to measure — typically low-dimensional marginals.</li>
<li><strong>Measure</strong> the selected queries privately using a noise-addition mechanism.</li>
<li><strong>Generate</strong> synthetic data that best explains the noisy measurements.<sup id="fnref:0" role="doc-noteref"><a href="#fn:0" class="footnote" rel="footnote">1</a></sup></li>
</ol>
<p>Mechanisms in this class differ primarily in their methodology for selecting queries and their algorithm for generating synthetic data from noisy measurements. The focus of this blog post is the final <strong>Generate</strong> step. Specifically, we will explore different ways in which one can model data distributions for the purpose of generating synthetic data, outlining the qualitative pros and cons of each method. We will then introduce the <strong><a href="https://github.com/ryan112358/private-pgm" target="_blank">Marginal-Based Inference (MBI)</a></strong> repository that provides methods that, given some set of noisy measurements, enables users to generate synthetic data in a generic and scalable way.</p>
<p>Separating the Generate subroutine from the existing synthetic data generation mechanisms greatly simplifies the design space of new differentially private mechanisms. It allows the mechanism designer to focus on <em>selecting the queries</em> to maximize utility of the synthetic data, rather than <em>how to generate synthetic data</em> that explain the noisy measurements well. Both are challenging technical problems that require different techniques to solve, and MBI provides principled solutions to the latter problem, while exposing an interface that can be readily adopted by mechanism designers.</p>
<h1 id="the-generate-subproblem-a-unifying-view">The Generate Subproblem: A Unifying View</h1>
<p>In this section we will introduce the main optimization problem that underlies several methods for the Generate subproblem, and provide a high-level overview of how each method attempts to solve this optimization problem. Let \( y = \mathcal{M}(D) \) be the noisy measurements obtained from running a privacy mechanism on a discrete dataset \( D \). Our goal is to post-process these noise measurements to obtain synthetic data that explains them well. In particular, we wish to minimize over the space of all <em>datasets</em> for one that maximizes the likelihood of the observations \( y \).<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup></p>
<p>\[ \hat{D} \in \text{arg} \max_{D \in \mathcal{D}} \log \mathbb{P}[\mathcal{M}(D) = y] \]</p>
<p>This is a high-dimensional discrete optimization problem, and is generally intractable to solve in practice, even in low-dimensional settings. It is common to consider the relaxed problem that instead optimizes over the set of <em>probability distributions</em> \( \mathcal{S} \):</p>
<p>\[ \hat{P} \in \text{arg} \max_{P \in \mathcal{S}} \log \mathbb{P}[\mathcal{M}(P) = y] \label{eq1} \tag{1} \]</p>
<p>More generally, we can consider any objective function that measures how well \( P \) explains \( y \). The log-likelihood is a natural choice, although other choices are also possible and used in practice. In the special-but-common case where the mechanism is an instance of the Gaussian mechanism, we have \( \mathcal{M}(D) = f(D) + \mathcal{N}(0, \sigma^2)^k \) and \( \log \mathbb{P}[\mathcal{M}(P) = y] \propto - || f(P) - y ||_2^2 \). If \( f \) is a linear function of \( P \), then Problem \ref{eq1} is simply a quadratic program. In the subsequent subsections, we will describe different approaches to solve or approximately solve Problem \ref{eq1}.</p>
<blockquote>
<p><strong>Remark 1</strong>: The distribution learned from solving Problem \ref{eq1} will resemble the true data with respect to the statistics measured by \( \mathcal{M} \). It may or may not accurately preserve other statistics — that is data dependent.</p>
</blockquote>
<blockquote>
<p><strong>Remark 2</strong>: The most common statistics to measure are <strong>low-dimensional marginals</strong>. A marginal for a subset of attributes counts the number of records in the dataset that match each setting of possible values. They are appealing statistics to measure because:</p>
<ul>
<li>They capture low-dimensional structure common in real world data distributions.</li>
<li>Each cell in a marginal is a count, a statistic that is fairly robust to noise.</li>
<li>One individual can only contribute to a single cell of a marginal, so all cells have low sensitivity and can be measured simultaneously with low privacy cost.</li>
</ul>
</blockquote>
<h3 id="direct">Direct</h3>
<p>We can attempt to solve Problem \ref{eq1} directly by utilizing any algorithm for convex optimization over the probability simplex, such as multiplicative weights. This method works well in low-dimensional regimes, although quickly becomes intractable for higher-dimensional domains, where it is generally intractable to even enumerate all the entries of a single distribution \( P \), let alone optimize over the space of all distributions.</p>
<p>Until recently, variants of the direct method were the only general-purpose solutions available for this problem, and as a result, many mechanisms struggled to scale to high-dimensional domains. Recently, several methods have been proposed that attempt to overcome the curse of dimensionality inherent in the direct approach, which scale by imposing additional assumptions on the mechanism \( \mathcal{M} \) and/or by relaxing the optimization problem. A common theme is to restrict attention to a subset of joint distributions which have tractable representations. The sections below describe these more scalable methods, including the different (implicit) assumptions each method makes, as well as the consequences of those assumptions.</p>
<h3 id="probabilistic-graphical-models-pgm">Probabilistic Graphical Models (PGM)</h3>
<p>The first method we describe is <a href="https://arxiv.org/abs/1901.09136" target="\_blank">PGM</a>, which was a key component of the first-place solution in the 2018 NIST Differential Privacy <a href="https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/2018-differential-privacy-synthetic" target="\_blank">Synthetic Data Competition</a> and in both the first and second-place solutions in the follow-up <a href="https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/current-and-upcoming-prize-challenges/2020-differential" target="\_blank">Temporal Map Competition</a>.</p>
<p>PGM scales by restricting attention to distributions that can be represented as a graphical model \( P_{\theta} \). The key observation of PGM is that when \( \mathcal{M} \) only depends on \( P \) through its low-dimensional marginals, then one of the optimizers of Problem \ref{eq1} is a graphical model with parameters \( \theta \). In this case, Problem \ref{eq1} is under-determined and typically has infinitely many solutions. It turns out that the solution found by PGM has maximum entropy among all solutions to the problem — a very natural way to break ties among equally good solutions. Remarkably, these facts are true for any dataset — they do not require the underlying data to be generated from a graphical model with the same structure <a href="https://arxiv.org/abs/2108.04978" target="\_blank">[MMS21]</a>.</p>
<p>The parameter vector \( \theta \) is often much smaller than \( P \), and we can efficiently optimize it, bypassing the curse of dimensionality in this special case. The size of \( \theta \) and in turn the complexity of PGM depends on the mechanism \( \mathcal{M} \), and in the worst case is the same as the Direct method.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">3</a></sup> However, in many common cases of practical interest, the complexity of PGM is exponentially better than that of Direct, in which case we can efficiently solve the optimization problem above, finding \( \theta \) and thus a tractable representation of \( \hat{P} \). The complexity ultimately depends on the size of the junction tree derived from the mechanism \( \mathcal{M} \), and understanding this relationship requires some expertise in graphical models. However, if we utilize this understanding to design \( \mathcal{M} \), we can avoid this worst-case behavior, as <a href="https://arxiv.org/abs/2108.04978" target="\_blank">MST</a> and <a href="http://vldb.org/pvldb/vol14/p2190-cai.pdf" target="\_blank">PrivMRF</a> do.</p>
<h3 id="relaxed-tabular">Relaxed Tabular</h3>
<p>An alternative approach was proposed in the recent <a href="https://arxiv.org/abs/2103.06641" target="\_blank">RAP</a> paper. The key idea is to restrict attention to “pseudo-distributions” that can be represented in a relaxed tabular format. The format is similar to the one-hot encoding of a discrete dataset, although the entries need not be \( 0 \) or \( 1 \), which enables gradient-based optimization to be performed on the cells in this table. The number of rows is a tunable knob that can be set to trade off expressive capacity with computational efficiency. With a sufficiently large knob size, the true minimizer of the original problem can be expressed in this way, but there is no guarantee that gradient-based optimization will converge to it because this representation introduces non-convexity. Moreover, the search space of this method includes “spurious” distributions, so even the global optimum of relaxed problem would not necessarily solve the original problem.<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote" rel="footnote">4</a></sup> Despite these drawbacks, this method appears to work well in practice.</p>
<h3 id="generative-networks">Generative Networks</h3>
<p>Among the iterative methods introduced by <a href="https://arxiv.org/abs/2106.07153" target="\_blank">[LVW21]</a> is GEM (Generative networks with the exponential mechanism), an approach inspired by generative adversarial networks. They propose representing any dataset as mixture of product distributions over attributes in the data domain. They implicitly encode such distributions using a generative neural network with a softmax layer. In concrete terms, given some Gaussian noise \( \mathbf{z} \sim \mathcal{N}(0, I) \), their <strong>Generate</strong> step outputs \( f_\theta(\mathbf{z}) \) where \( f \) is some feedforward neural network parametrized by \( \theta \). \( f_\theta(\mathbf{z}) \) represents a collection of marginal distributions for each individual attribute in the domain, which can be used to directly answer any k-way marginal query. Alternatively, one can sample directly from \( f_\theta(\mathbf{z}) \) if the goal is generate synthetic tabular data.</p>
<p>Note that the size of \( \mathbf{z} \) can be arbitrarily large, meaning that this generative network approach can theoretically be scaled up to capture any distribution \( P \). Moreover, <a href="https://arxiv.org/abs/2106.07153">[LVW21]</a> show that one can achieve strong performance in practical settings even when \( \mathbf{z} \) is small, making such generative network approaches to scale in terms of both computation and memory. Howevever, as is commonly found in deep learning methods, this optimization problem is nonconvex.</p>
<h3 id="local-consistency">Local Consistency</h3>
<p>Finally, <a href="https://arxiv.org/abs/2106.07153" target="\_blank">GUM</a> and <a href="https://arxiv.org/abs/2109.06153" target="\_blank">APPGM</a> do not search over any space of distributions, but instead impose <em>local consistency</em> constraints on the noisy measurements. These methods relax Problem \ref{eq1} to optimize over the space of pseudo-marginals, rather than distributions. The pseudo-marginals are required to be internally consistent, but there is no guarantee that there is a distribution which realizes those pseudo-marginals. As a result, the solution found by these methods need not be feasible in Problem \ref{eq1}. Nevertheless, we can attempt to generate synthetic data using heuristics to translate these locally consistent pseudo-marginals into synthetic tabular data. This approach was used by team DPSyn in both NIST competitions.</p>
<h3 id="summary">Summary</h3>
<p>A qualitative comparison between the discussed methods is given in the table below.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">5</a></sup></p>
<blockquote>
<p><strong>Remark 3</strong>: Among the alternatives discussed here, only Direct and PGM can be expected to solve Problem \ref{eq1}. The alternatives fail to solve Problem \ref{eq1} in general, either from non-convexity, or from introducing spurious distributions to the search space. This distinguishing feature of PGM comes at a cost: the complexity can be much higher than the alternatives, and in the worst-case, will not be feasible to run. In such cases, one of the approximations must be used instead.</p>
</blockquote>
<table>
<tbody>
<tr>
<td> </td>
<td><strong>Direct</strong></td>
<td><strong>PGM</strong></td>
<td><strong>Relaxed Tabular</strong></td>
<td><strong>Generative Networks</strong></td>
<td><strong>Local Consistency</strong></td>
</tr>
<tr>
<td>Search space includes optimum</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
</tr>
<tr>
<td>Search space excludes spurious distributions</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
</tr>
<tr>
<td>Convexity preserving</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:green">Yes</span></td>
</tr>
<tr>
<td>Solves Problem \ref{eq1}</td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:green">Yes</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
<td><span style="color:red">No</span></td>
</tr>
<tr>
<td>Factors influencing scalability</td>
<td><span style="color:red">Size of Entire Domain</span></td>
<td><span style="color:orange">Size of Junction Tree</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
<td><span style="color:green">Size of Largest Marginal</span></td>
</tr>
</tbody>
</table>
<h1 id="generating-synthetic-data-with-mbi">Generating Synthetic Data with MBI</h1>
<p>Now that we have introduced the techniques underlying the Generate step, we will show how to utilize the implementations in the MBI repository to develop end-to-end mechanisms for differentially private synthetic data.</p>
<h2 id="preparing-noisy-measurements">Preparing Noisy Measurements</h2>
<p>The input to any method for Generate is a collection of noisy measurements. We show below how to prepare these measurements in a format compatible with the methods for Generate implemented in the MBI repository. The measurements are represented as a list, where each element of the list is a noisy marginal (represented as a numpy array), along with relevant metadata including the attributes in the marginal and the amount of noise used to answer it. In the code snippet below, the selected marginals are hard-coded, but in general this list can be modified to tailor the synthetic data towards a different set of marginals.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">sparse</span>
<span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'adult.csv'</span><span class="p">,</span> <span class="s">'adult-domain.json'</span><span class="p">)</span>
<span class="c1"># SELECT the marginals we'd like to measure
</span><span class="n">marginals</span> <span class="o">=</span> <span class="p">[(</span><span class="s">'marital-status'</span><span class="p">,</span> <span class="s">'sex'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'education-num'</span><span class="p">,</span> <span class="s">'race'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'sex'</span><span class="p">,</span> <span class="s">'hours-per-week'</span><span class="p">),</span>
<span class="p">(</span><span class="s">'workclass'</span><span class="p">,),</span>
<span class="p">(</span><span class="s">'marital-status'</span><span class="p">,</span> <span class="s">'occupation'</span><span class="p">,</span> <span class="s">'income>50K'</span><span class="p">)]</span>
<span class="c1"># MEASURE the marginals and log the noisy answers
</span><span class="n">sigma</span> <span class="o">=</span> <span class="mi">50</span>
<span class="n">measurements</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">M</span> <span class="ow">in</span> <span class="n">marginals</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">project</span><span class="p">(</span><span class="n">M</span><span class="p">).</span><span class="n">datavector</span><span class="p">()</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">sigma</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">I</span> <span class="o">=</span> <span class="n">sparse</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">)</span>
<span class="n">measurements</span><span class="p">.</span><span class="n">append</span><span class="p">(</span> <span class="p">(</span><span class="n">I</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">M</span><span class="p">)</span> <span class="p">)</span></code></pre></figure>
<p>The above code snippet is a 5-fold composition of Gaussian mechanisms with \( \sigma = 50 \), and hence the entire mechanism is \( \frac{5}{2 \sigma^2} = \frac{1}{1000} \)-zCDP.</p>
<h2 id="generating-synthetic-data-from-measurements">Generating Synthetic Data from Measurements</h2>
<p>Given measurements represented in the format above, we can readily generate synthetic data using one of several methods. For example, the code snippet below generates synthetic data that approximately matches the noisy measurements:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">FactoredInference</span> <span class="c1"># PGM
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">MixtureInference</span> <span class="c1"># Relaxed Tabular + Softmax
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">LocalInference</span> <span class="c1"># Local Consistency
</span><span class="kn">from</span> <span class="nn">mbi</span> <span class="kn">import</span> <span class="n">PublicInference</span> <span class="c1"># Not Discussed
</span>
<span class="c1"># GENERATE synthetic data using PGM
</span><span class="n">engine</span> <span class="o">=</span> <span class="n">FactoredInference</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">domain</span><span class="p">,</span> <span class="n">iters</span><span class="o">=</span><span class="mi">2500</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">engine</span><span class="p">.</span><span class="n">estimate</span><span class="p">(</span><span class="n">measurements</span><span class="p">)</span>
<span class="n">synth</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">synthetic_data</span><span class="p">()</span></code></pre></figure>
<p>To generate synthetic data, we have to simply instantiate one of the inference engines imported. In the code snippet above, we use the FactoredInference engine, which corresponds to the PGM method. The other inference engines share the same interface, and can be used instead if desired.</p>
<blockquote>
<p><strong>Remark 4</strong>: By utilizing the inference engines implemented in MBI, end-to-end synthetic data mechanisms can be written with remarkably little code. This simple example required less than 25 lines of code, and <a href="https://github.com/ryan112358/private-pgm/tree/master/mechanisms" target="\_blank">more complex mechanisms</a> can usually be written in a single file with less than 200 lines of code. As a result, future research can focus on the measurement selection subproblem, and new ideas can more rapidly be evaluated and iterated on.</p>
</blockquote>
<p>We evaluated the quality of the synthetic data generated by measuring the error of the measured marginals. Interestingly, the synthetic data has lower error than the noisy marginals, with reductions in error up to 30% for the larger marginals, and around 3% for the smaller ones.</p>
<p><img src="/images/smr1.png" alt="" /></p>
<blockquote>
<p><strong>Remark 5:</strong> It is not surprising that the synthetic data enjoys lower error than the noisy marginals. Problem \ref{eq1} can be seen as a <em>projection problem</em>, and there is substantial theoretical <a href="https://arxiv.org/abs/1212.0297" target="\_blank">[NTZ12]</a> and empirical [<a href="https://dl.acm.org/doi/abs/10.1145/2783258.2783366" target="\_blank">LWK15</a>, <a href="https://systems.cs.columbia.edu/private-systems-class/papers/Abowd2019Census.pdf" target="\_blank">AAGK+19</a>] evidence that solving this problem reduces error. Intuitively, the benefit arises due to the inconsistencies in the noisy observations that are resolved through the optimization procedure.</p>
</blockquote>
<p>We can also use the synthetic data to estimate marginals we didn’t measure with the Gaussian mechanism. These estimates may or may not be accurate, it depends on the data and the marginal being estimated. For example, the error on the (sex, income>50K) marginal is around 0.02, while the error on the (education-num, occupation) marginal is about 0.5.</p>
<p><img src="/images/smr2.png" alt="" /></p>
<blockquote>
<p><strong>Remark 6:</strong> The fact that the synthetic data is not accurate for some marginals is not a limitation of the method used for Generate, but rather an artifact of what marginals were selected. Thus, it is clear that selecting the right marginals to measure plays a crucial role in the quality of the synthetic data. This is an important open problem that will be the topic of a future blog post.</p>
</blockquote>
<h1 id="coming-up-next">Coming up Next</h1>
<p>In this blog post, we focused on the <strong>Generate</strong> step of the select-measure-generate paradigm. For the next blog post in this series, we will focus on state-of-the-art approaches to the <strong>Select</strong> sub-problem. If you have any comments, questions, or remarks, please feel free to share them in the comments section below. If you would like to try generating synthetic data with MBI, check out this <a href="https://colab.research.google.com/drive/1c8gT5m_GWfQoa_mx8eXh4sPD48Y0z3ML?usp=sharing">jupyter notebook</a> on Google Colab!</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:0" role="doc-endnote">
<p>The Generate step is a post-processing of already privatized noisy marginals, and therefore the privacy analysis only needs to reason about the first two steps. <a href="#fnref:0" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Here we assume that \( \mathcal{M} \) is a mechanism with a discrete output space. In practice, this is always the case because any mechanism implemented on a finite computer must have a discrete output space. For continuous output spaces, interpret the objective function as a log density rather than a log probability. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>For example, this worst-case behavior is realized if <strong>all</strong> 2-way marginals are measured. While this can be seen as a limitation of PGM, <a href="http://people.seas.harvard.edu/~salil/research/synthetic-Feb2010.pdf">it is known</a> that generating synthetic data that preserves all 2-way marginals is computationally hard in the worst-case. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>This idea was refined into <a href="https://arxiv.org/abs/2106.07153" target="\_blank">RAP<sup>softmax</sup></a> in follow-up-work, which overcomes the latter issue, but does not resolve the non-convexity issue. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>These approximations were all developed concurrently, and systematic empirical comparisons between them (and PGM) have not been done to date. Some experimental comparisons can be found in <a href="https://arxiv.org/abs/2106.07153" target="\_blank">[LVW21]</a> and <a href="https://arxiv.org/abs/2109.06153">[MPSM21]</a>. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Ryan McKennaTerrance LiuThu, 27 Jan 2022 09:00:00 -0800
https://differentialprivacy.org/synth-data-1/
https://differentialprivacy.org/synth-data-1/What is Synthetic Data?<p>The concept of synthetic data seems to be having “a moment” in the privacy world as a promising approach to sharing data while protecting privacy. Strictly speaking, any time you make up data, you have produced a synthetic dataset, but more specifically</p>
<blockquote>
<p>a <strong>synthetic dataset</strong> is a stand-in for some original dataset that has the same format, and accurately reflects the statistical properties of the original dataset, but contains only “fake” records.</p>
</blockquote>
<p>Intuitively, a synthetic dataset can be used as if it were the real data—we can stare at it, compute summary statistics from it, train models from it, and do everything we normally do with data—but because the records do not correspond to “real” people, we don’t have to worry about protecting privacy.</p>
<p>To be sure, synthetic data is a very appealing concept, but</p>
<blockquote>
<p>simply making data “synthetic” does not guarantee privacy in any meaningful sense of the word,</p>
</blockquote>
<p>and we need to be careful about what it actually means to generate <strong>private synthetic data.</strong></p>
<p>In this post I will briefly describe <strong>differentially private synthetic data</strong>, why we need it, and what we know about producing it. The post will serve as a warmup to some more technical posts that will follow later, describing general approaches and state-of-the-art algorithms for producing synthetic data.</p>
<h3 id="what-is-synthetic-data">What is synthetic data?</h3>
<p>Given a dataset \(X\), a synthetic dataset \(Y\) is a new dataset that has the same <em>structure</em> as \(X\), but whose elements are “fake.” For example, if \(X\) is a highly stylized version of the data collected in the US decennial census, then each element of \(X\) would be a tuple of the form (age, census_block, sex, race, ethnicity) corresponding to the data of a real person who filled out a census card. A synthetic dataset should also contain such tuples, but they no longer need to correspond to real people.</p>
<p>An example of a synthetic dataset for the census would be a dataset consisting of 300,000,000 individuals, all living at one location in rural Mississippi, and all of whom are white, non-hispanic females aged 81. And therein lies the challenge. Of course we want a synthetic dataset not to merely have the same <em>structure</em> as the original data, but also to preserve the <em>statistical properties</em> of the original data. A synthetic dataset for the census should put roughly the right number of individuals of the right types in the right places to reflect the actual population of the US.</p>
<blockquote>
<p>Thus, to generate synthetic data, we have to know something about the original data, which creates opportunities to leak sensitive information.</p>
</blockquote>
<h4 id="synthetic-data-from-generative-modeling">Synthetic data from generative modeling</h4>
<p>Ensuring that a synthetic dataset really only reflects the statistical properties of the real data, and doesn’t encode sensitive information about the individuals in the real dataset, is quite difficult. I could go on forever about strawman proposals that would generate records that appear to be “fake” but really just encode the original dataset in obvious ways. Just so you believe I could do it, suppose for every real record, we get a fake record by adding 10 years to that individual’s age. But what people typically mean when they talk about generating synthetic data, and what the synthetic data products being sold by companies typically do, is train a generative model for the dataset \(X\) and obtain the synthetic dataset \(Y\) by sampling from that model. The hope is that samples from the generative model look like the original dataset in aggregate, but the actual people in the sample have nothing to do with the real people in the original dataset. For example, a generative model from the US population will lead to more people living in California than in Montana, but ideally wouldn’t simply spit out the data of the actual residents of California.</p>
<h4 id="generative-models-dont-magically-protect-privacy">Generative models don’t magically protect privacy</h4>
<p>Simply training a generative model on \(X\) doesn’t actually mean we’ve hidden anything about \(X\) or the people in it. For an extreme example, suppose the records in \(X\) are \(x_1,\dots,x_n\) and our generative model is arbitrarily expressive and determines that the best model for the data is just the uniform distribution on the set \(x_1,\dots,x_n\). Then if our synthetic dataset consists of \(n\) iid samples from the generative model, our synthetic dataset will simply contain one or more copies of the real data for a large fraction (about 70%) of the people in the original dataset.</p>
<h4 id="reconstruction-attacks-say-hello">Reconstruction attacks say hello</h4>
<p>My example was admittedly a strawman, and when we train a generative model we typically expect it to produce new examples that aren’t in the original training dataset. Maybe by training generative models in the right way, we can ensure that the generative model only captures statistics about the dataset, without revealing information about the individuals in the dataset? Let’s not forget that</p>
<blockquote>
<p>a synthetic dataset is just one of many possible ways to release statistical information about a dataset, and we know from the theory of reconstruction attacks that <strong>any</strong> mechanism that accurately preserves too many statistics inevitably destroys privacy.</p>
</blockquote>
<p>Specifically, there is a rich and fairly comprehensive theory of <strong>reconstruction attacks</strong>, whose <a href="/reconstruction-theory/">theory</a> and <a href="/diffix-attack/">practice</a> we detailed in earlier posts. This theory says that any synthetic dataset that preserves answers to every statistic of the form “How many elements of the dataset satisfies property P?” with any meaningful notion of accuracy, is spectacularly non-private, in the sense that an attacker can reconstruct nearly all of the private information in the original dataset. Viewed through this lens, synthetic data becomes something of a red herring. The important question is which subset of statistics we want to preserve and how to preserve them without compromising privacy.</p>
<h3 id="differentially-private-synthetic-data">Differentially private synthetic data</h3>
<p>Making a dataset “synthetic” may not be a magic bullet, but it’s still a useful goal. Fortunately, we already have a good working definition of what it means for an algorithm to protect the privacy of the individuals in the dataset—<strong>differential privacy</strong>. So perhaps we can design a differentially private algorithm that takes the original dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves many of the properties of \(X\) accurately (but still within the limits imposed by reconstruction attacks)? In fact, the answer turns out to be a resounding yes!</p>
<blockquote>
<p>There is a differentially private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves an <strong>exponential</strong> number of statistical properties of \(X\) <a href="https://arxiv.org/abs/1109.2229">[BLR08]</a>.</p>
</blockquote>
<p>This statement is intentionally quite informal, but intutively this result says that we can hope to do amazing things, like generate a synthetic dataset that satisfies the strong guarantee of differential privacy while still approximately preserving impressively complex statistics of the dataset, such as the marginal distribution of every set of three attributes, or the prediction error of every linear classifier.</p>
<p>So what’s the problem? Well, primarily it’s computational complexity. This algorithm, and the many beautiful algorithms it inspired, all have exponential worst-case running time. Specifically, if each element of the dataset contains \(d\) bits, then the worst-case running time is at least as large as \(2^d\). Unfortunately, this turns out to be an inherent bottleneck for any algorithm that generates synthetic data.</p>
<blockquote>
<p>Any private algorithm that takes a dataset \(X\) and outputs a synthetic dataset \(Y\) that preserves even just the correlations between each pair of features must have worst-case running time that grows exponentially in the number of features, under widely believed complexity assumptions. <a href="https://eccc.weizmann.ac.il/report/2010/017/">[UV11]</a></p>
</blockquote>
<p>While there are inherent limits of the accuracy of all differentially private algorithms (see <a href="https://privacytools.seas.harvard.edu/publications/exposed-survey-attacks-private-data">[DSSU17]</a>), and there are significant barriers to making differentially private algorithms practical,</p>
<blockquote>
<p>computational complexity is the main barrier that is specific to differentially private algorithms for generating synthetic data.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
</blockquote>
<h4 id="where-do-we-go-from-here">Where do we go from here?</h4>
<p>Despite the computational bottlenecks, there has stll been a lot of amazing progress on differentially private synthetic data. And, despite my criticism of generative modeling in isolation as a means of generating synthetic data, most of these approaches are indeed based on <em>differentially private generative models</em> such as sparse graphical models or GANs, and leverage the fact that these types of models can typically be fit efficiently to realistic data, despite their worst-case hardness. In the next few posts, we will try to describe the landscape of the most promising approaches that we currently have available.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>See <a href="https://arxiv.org/abs/2110.13239">[AACGKLSSTZ21]</a> for an interesting example of a <em>statistical</em> limitation that is specific to differentially private algorithms that generate synthetic data. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Jonathan UllmanMon, 10 Jan 2022 15:00:00 -0800
https://differentialprivacy.org/synth-data-0/
https://differentialprivacy.org/synth-data-0/Conference Digest - NeurIPS 2021<p>The accepted papers for <a href="https://neurips.cc/Conferences/2020">NeurIPS 2021</a> were recently announced, and there’s a huge amount of differential privacy content.
We found one relevant workshop and 48 papers.
This is up from 31 papers last year, an over 50% increase!
It looks like there’s huge growth in interest on differentially private machine learning.
Impressively, at the time of this writing, all but five papers are already posted on arXiv!
For the full list of accepted papers, see <a href="https://neurips.cc/Conferences/2021/AcceptedPapersInitial">here</a>.
Please let us know if we missed relevant papers on differential privacy!</p>
<h2 id="workshops">Workshops</h2>
<ul>
<li><a href="https://priml2021.github.io/">Privacy in Machine Learning (PriML) 2021</a></li>
</ul>
<h2 id="papers">Papers</h2>
<ul>
<li>
<p><a href="https://arxiv.org/abs/2103.08721">A Central Limit Theorem for Differentially Private Query Answering</a><br />
Jinshuo Dong, Weijie Su, Linjun Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02391">Adapting to function difficulty and growth conditions in private optimization</a><br />
Hilal Asi, Daniel Levy, John Duchi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.04378">Adaptive Machine Unlearning</a><br />
Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, Chris Waites</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.13239">An Uncertainty Principle is a Price of Privacy-Preserving Microdata</a><br />
John Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Daniel Kifer, Philip Leclerc, William Sexton, Ashley Simpson, Christine Task, Pavel Zhuravlev</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03408">Antipodes of Label Differential Privacy: PATE and ALIBI</a><br />
Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, Florian Tramer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13329">Covariance-Aware Private Mean Estimation Without Private Covariance Estimation</a><br />
Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, Lydia Zakynthinou</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.06062">Deep Learning with Label Differential Privacy</a><br />
Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.05855">Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent</a><br />
Rishav Chourasia, Jiayuan Ye, Reza Shokri</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02674">Differentially Private Empirical Risk Minimization under the Fairness Lens</a><br />
Cuong Tran, My Dinh, Ferdinando Fioretto</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.14153">Differentially Private Federated Bayesian Optimization with Distributed Exploration</a><br />
Zhongxiang Dai, Bryan Kian Hsiang Low, Patrick Jaillet</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1905.03871">Differentially Private Learning with Adaptive Clipping</a><br />
Galen Andrew, Om Thakkar, Swaroop Ramaswamy, Brendan McMahan</p>
</li>
<li>
<p>Differentially Private Model Personalization<br />
Prateek Jain, John Rush, Adam Smith, Shuang Song, Abhradeep Guha Thakurta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02900">Differentially Private Multi-Armed Bandits in the Shuffle Model</a><br />
Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2108.02831">Differentially Private n-gram Extraction</a><br />
Kunho Kim, Sivakanth Gopi, Janardhan Kulkarni, Sergey Yekhanin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02516">Differential Privacy Over Riemannian Manifolds</a><br />
Matthew Reimherr, Karthik Bharath, Carlos Soto</p>
</li>
<li>
<p>Differentially Private Sampling from Distributions<br />
Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith, Marika Swanberg</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.05585">Differentially Private Stochastic Optimization: New Results in Convex and Non-Convex Settings</a><br />
Raef Bassily, Cristóbal Guzmán, Michael Menart</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.01177">Don’t Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence</a><br />
Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, Karsten Kreis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.09063">Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization</a><br />
Pranav Subramani, Nicholas Vadivelu, Gautam Kamath</p>
</li>
<li>
<p>Exact Privacy Guarantees for Markov Chain Implementations of the Exponential Mechanism with Artificial Atoms<br />
Jeremy Seeman, Matthew Reimherr, Aleksandra Slavković</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.03013">Fast and Memory Efficient Differentially Private-SGD via JL Projections</a><br />
Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Hanwen Shen, Uthaipon Tantipongpipat</p>
</li>
<li>
<p>G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators<br />
Yunhui Long, Boxin Wang, Zhuolin Yang, Bhavya Kailkhura, Aston Zhang, Carl Gunter, Bo Li</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03365">Generalized Linear Bandits with Local Differential Privacy</a><br />
Yuxuan Han, Zhipeng Liang, Yang Wang, Jiheng Zhang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2008.11193">Individual Privacy Accounting via a Rényi Filter</a><br />
Vitaly Feldman, Tijana Zrnic</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2104.00979">Information-constrained optimization: can adaptive processing of gradients help?</a><br />
Jayadev Acharya, Clement Canonne, Prathamesh Mayekar, Himanshu Tyagi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00463">Instance-optimal Mean Estimation Under Differential Privacy</a><br />
Ziyue Huang, Yuting Liang, Ke Yi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.07153">Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods</a><br />
Terrance Liu, Giuseppe Vietri, Steven Wu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.11845">Learning with User-Level Privacy</a><br />
Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.13513">Littlestone Classes are Privately Online Learnable</a><br />
Noah Golowich, Roi Livni</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2010.07778">Local Differential Privacy for Regret Minimization in Reinforcement Learning</a><br />
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.03940">Locally differentially private estimation of functionals of discrete distributions</a><br />
Cristina Butucea, Yann Issartel</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2105.10675">Locally private online change point detection</a><br />
Tom Berrett, Yi Yu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.10870">Multiclass versus Binary Differentially Private PAC Learning</a><br />
Satchit Sivakumar, Mark Bun, Marco Gaboardi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02848">Numerical Composition of Differential Privacy</a><br />
Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.11526">On the Sample Complexity of Privately Learning Axis-Aligned Rectangles</a><br />
Menachem Sadigurschi, Uri Stemmer</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.03645">Photonic Differential Privacy with Direct Feedback Alignment</a><br />
Ruben Ohana, Hamlet Medina, Julien Launay, Alessandro Cappelli, Iacopo Poli, Liva Ralaivola, Alain Rakotomamonjy</p>
</li>
<li>
<p>Private and Non-private Uniformity Testing for Ranking Data<br />
Róbert Busa-Fekete, Dimitris Fotakis, Emmanouil Zampetakis</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.07171">Private learning implies quantum stability</a><br />
Yihui Quek, Srinivasan Arunachalam, John A Smolin</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.15352">Private Non-smooth ERM and SCO in Subquadratic Steps</a><br />
Janardhan Kulkarni, Yin Tat Lee, Daogao Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.02162">Privately Learning Mixtures of Axis-Aligned Gaussians</a><br />
Ishaq Aden-Ali, Hassan Ashtiani, Christopher Liaw</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2106.00001">Privately Learning Subspaces</a><br />
Vikrant Singhal, Thomas Steinke</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2111.02281">Privately Publishable Per-instance Privacy</a><br />
Rachel Redberg, Yu-Xiang Wang</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2109.06153">Relaxed Marginal Consistency for Differentially Private Query Answering</a><br />
Ryan McKenna, Siddhant Pradhan, Daniel Sheldon, Gerome Miklau</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2103.03279">Remember What You Want to Forget: Algorithms for Machine Unlearning</a><br />
Ayush Sekhari, Jayadev Acharya, Gautam Kamath, Ananda Theertha Suresh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2107.08763">Renyi Differential Privacy of The Subsampled Shuffle Model In Distributed Learning</a><br />
Antonious Girgis, Deepesh Data, Suhas Diggavi</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2102.09159">Robust and differentially private mean estimation</a><br />
Xiyang Liu, Weihao Kong, Sham Kakade, Sewoong Oh</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.04995">The Skellam Mechanism for Differentially Private Federated Learning</a><br />
Naman Agarwal, Peter Kairouz, Ken Liu</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/2110.11208">User-Level Differentially Private Learning via Correlated Sampling</a><br />
Badih Ghazi, Ravi Kumar, Pasin Manurangsi</p>
</li>
</ul>
Gautam KamathTue, 09 Nov 2021 11:00:00 -0400
https://differentialprivacy.org/neurips2021/
https://differentialprivacy.org/neurips2021/How to deploy machine learning with differential privacy?<p>In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.</p>
<h2 id="why-do-we-need-private-machine-learning-algorithms">Why do we need private machine learning algorithms?</h2>
<p>Machine learning algorithms work by studying a lot of data and updating their parameters to encode the relationships in that data. Ideally, we would like the parameters of these machine learning models to encode general patterns (e.g., ‘‘patients who smoke are more likely to have heart disease’’) rather than facts about specific training examples (e.g., “Jane Smith has heart disease”). Unfortunately, machine learning algorithms do not learn to ignore these specifics by default. If we want to use machine learning to solve an important task, like making a cancer diagnosis model, then when we publish that machine learning model (for example, by making an open source cancer diagnosis model for doctors all over the world to use) we might also inadvertently reveal information about the training set. A malicious attacker might be able to inspect the published model’s predictions and learn private information about Jane Smith. For instance, the adversary could mount a membership inference attack to know whether or not Jane Smith contributed her data to the model’s training set [SSS17]. The adversary could also build on membership inference attacks to extract training data by repeatedly guessing possible training points until they result in a sufficiently strong membership signal from the model’s prediction [CTW20]. In many instances, the model itself may be represented by a few of the data samples (e.g., Support Vector Machine in its dual form).</p>
<p>A common misconception is that if a model generalizes (i.e., performs well on the test examples), then it preserves privacy. As mentioned earlier, this is far from being true. One of the main reasons being that generalization is an average case behavior of a model (over the distribution of data samples), whereas privacy must be provided for everyone, including outliers (which may deviate from our distributional assumptions).</p>
<p>Over the years, researchers have proposed various approaches towards protecting privacy in learning algorithms (k-anonymity [SS98], l-diversity [MKG07], m-invariance [XT07], t-closeness [LLV07] etc.). Unfortunately, [GKS08] all these approaches are vulnerable to what are called composition attacks, that use auxiliary information to violate the privacy protection. Famously, this strategy allowed researchers to de-anonymize part of a movie ratings dataset released to participants of the Netflix Prize when the individuals had also shared their movie ratings publicly on the Internet Movie Database (IMDb) [NS08]. If Jane Smith had assigned the same ratings to movies A, B and C in the Netflix Prize dataset and publicly on IMDb at similar times, then researchers could link data corresponding to Jane across both datasets. This would in turn give them the means to recover ratings that were included in the Netflix Prize but not on IMDb. This example shows how difficult it is to define and guarantee privacy because it is hard to estimate the scope of knowledge—about individuals—available to adversaries. While the dataset released by Netflix has since been taken down, it is difficult to ensure that all of its copies have been deleted. In recent years, data sample instance encoding based methods like InstaHide [HSL20], and NeuraCrypt [YEO21] have been demonstrated to be vulnerable to such composition attacks as well.</p>
<p>As a result, the research community has converged on differential privacy [DMNS06], which provides the following semantic guarantee, as opposed to ad-hoc approaches: An adversary learns almost the same information about an individual whether or not they are present in or absent from the training data set. In particular, it provides a condition on the algorithm, independent from who might be attacking it, or the specifics of the data set instantiation. Put another way, differential privacy is a framework for evaluating the guarantees provided by a system that was designed to protect privacy. Such systems can be applied directly to “raw” data which potentially still contains sensitive information, altogether removing the need for procedures that sanitize or anonymize data and are prone to the failures described previously. That said, minimizing data collection in the first place remains a good practice to limit other forms of privacy risk.</p>
<h2 id="designing-private-machine-learning-algorithms-via-differential-privacy">Designing Private Machine Learning Algorithms via Differential Privacy</h2>
<p>Differential privacy [DMNS06] is a semantic notion of privacy that addresses a lot of the limitations of previous approaches like k-anonymity. The basic idea is to randomize part of the mechanism’s behavior to provide privacy. In our case, the mechanism considered is a learning algorithm, but the differential privacy framework can be applied to study any algorithm.</p>
<p>The intuition for why we introduce randomness into the learning algorithm is that it obscures the contribution of an individual, but does not obscure important statistical patterns. Without randomness, we would be able to ask questions like: “What parameters does the learning algorithm choose when we train it on this specific dataset?” With randomness in the learning algorithm, we instead ask questions like: “What is the probability that the learning algorithm will choose parameters in this set of possible parameters, when we train it on this specific dataset?”</p>
<p>We use a version of differential privacy which requires (in our use case of machine learning) that the probability of learning any particular set of parameters stays roughly the same if we change a single data record in the training set. A data record can be a single training example from an individual, or the collection of all the training examples provided by an individual. The former is often referred to as example level/item level privacy, and the latter is referred to as user level differential privacy. While user level privacy provides stronger semantics, it may be harder to achieve. For a more thorough discussion about the taxonomy of these notions, see [DNPR10, JTT18, HR12, HR13]. In this document, for the ease of exposition of the technical results, we focus on the example level notion. This could mean to add a training example, remove a training example, or change the values within one training example. The intuition is that if a single patient (Jane Smith) does not affect the outcome of learning much, then that patient’s records cannot be memorized and her privacy is respected. In the rest of this post, how much a single record can affect the outcome of learning is called the sensitivity of an algorithm.</p>
<p>The guarantee of differential privacy is that the adversary is not able to distinguish the answers produced by the randomized algorithm based on the data of two of the three users from the answers returned by the same algorithm based on the data of all three users. We also refer to the degree of indistinguishability as the privacy loss. Smaller privacy loss corresponds to a stronger privacy guarantee.</p>
<p>It is often thought that privacy is a fundamental bottleneck in obtaining good prediction accuracy/generalization for machine learning algorithms. In fact, recent research has shown that in many instances it actually helps in designing algorithms with strong generalization ability. Some of the examples where DP has resulted in designing better learning algorithms are Online linear predictions [KV05] and online PCA [DTTZ13]. Notably, [DFH15] formally showed that generalization for any DP learning algorithm comes for free. More concretely, if a DP learning algorithm has good training accuracy, it is guaranteed to have good test accuracy.
This is true because differential privacy itself acts as a very strong form of regularization.</p>
<p>One might argue that the generalization guarantee which a DP algorithm can achieve may be sub-par to that of its non-private baselines. For a large class of learning tasks, one can show that asymptotically DP does not introduce any further error beyond the inherent statistical error [SSTT21]. [ACG16,BFTT19] highlights that in the presence of enough data, a DP algorithm can get arbitrarily close to the inherent statistical error, even under strong privacy parameters.</p>
<h2 id="private-empirical-risk-minimization">Private Empirical Risk Minimization</h2>
<p>Before we go into the design of specific differentially private learning algorithms, we first formalize the problem setup, and standardize some notation. Consider a training data set \(D={(x_1,y_1),…,(x_n,y_n)}\) drawn i.i.d. from some fixed (unknown) distribution \(\Pi\), with the feature vector being \(x_i\) and label/response being \(y_i\). We define the training loss at any model \(\theta\) as \(L_{train} (\theta, D) = \frac{1}{n} \sum_{i=1}^{n} l(\theta; (x_i, y_i))\), and the corresponding test loss as \(L_{test} (\theta) = E_{(x,y) \sim \Pi} l(\theta;(x,y)) \).
We will design DP algorithms to output models that approximately minimize the test loss while having access only to the training loss.</p>
<p>In the literature, there are a variety of approaches towards designing these DP learning algorithms [CMS11, KST12, BST14, PAE16, BTT18]. One can categorize them broadly as: i) algorithms that assume that the individual loss function \(l(\theta;\cdot) \) is convex in the model parameter to ensure differential privacy, ii) algorithms that are differentially private even when the loss function is non-convex in nature (e.g., deep learning models), and iii) model agnostic algorithms, that do not require any information about the representation of the model \(\theta\), or the loss function \(l(\theta;\cdot) \). In our current discussion, we will only focus on designing algorithms for (ii), and (iii). This is because it turns out that the best known algorithms for (ii) are already competitive to algorithms that are specific for (i) [INS19].</p>
<h2 id="private-algorithms-for-training-deep-learning-models">Private Algorithms for Training Deep Learning Models</h2>
<p>The first approach, due to SCS13, BST14, and ACG16, is named differentially private stochastic gradient descent (DP-SGD). It proposes to modify the model updates computed by the most common optimizer used in deep learning: stochastic gradient descent (SGD). Typically, stochastic gradient descent trains iteratively. At each iteration, a small number of training examples (a “minibatch”) are sampled from the training set. The optimizer computes the average model error on these examples, and then differentiates this average error with respect to each of the model parameters to obtain a gradient vector. Finally, the model parameters (\(\theta_t\)) are updated by subtracting this gradient (\(\nabla_t\)) multiplied by a small constant \(\eta\) (the learning rate controls how quickly the optimizer updates the model’s parameters). At a high level, two modifications are made by DP-SGD to obtain differential privacy: gradients, which are computed on a per-example basis (rather than averaged over multiple examples), are first clipped to control their sensitivity, and, second, spherical Gaussian noise \(b_t\) is added to their sum to obtain the indistinguishability needed for DP. Succinctly, the update step can be written as follows: \(\theta_{t+1} \leftarrow \theta_t - \eta \cdot (\nabla_t + b_t)\).</p>
<p>Let us take the example of a hospital training a model to predict whether patients will be readmitted after being released. To train the model, the hospital uses information from patient records, such as demographic variables and admission variables (e.g., age, ethnicity, insurance type, type of Intensive Care Unit admitted to) but also time-varying vitals and labs (e.g., heart rate, blood pressure, white blood cell counts) [JPS16]. The modifications made by DP-SGD ensure that if (1) Jane Smith’s individual patient record contained unusual features, e.g., her insurance provider was uncommon for people of her age or her heart rate followed an unusual pattern, the resulting signal will have a bounded impact on our model updates, and (2) the model’s final parameters would be essentially identical should Jane Smith have chosen to not contribute (i.e., opt-out) her patient record to the training set. Stronger differential privacy is achieved when one is able to introduce more noise (i.e., sample noise with larger standard deviation) and train for as few iterations as possible.</p>
<p>Two main components in the above DP-SGD algorithm that distinguishes itself from traditional SGD are: i) per-example clipping and ii) Gaussian noise addition. In addition, for the analysis to hold, DP-SGD requires that sub-sampling of mini batches is uniform at random from the training data set. While this is not a requirement of DP-SGD per se, in practice many implementations of SGD do not satisfy this requirement and instead analyze different permutations of the data at each epoch of training.</p>
<p>While gradient clipping is common in deep learning, often used as a form of regularization, it differs from that in DP-SGD as follows: The average gradient over the minibatch is clipped, as opposed to clipping the gradient of individual examples (i.e., \(l(\theta_t;(x,y)) \) before averaging. It is an ongoing research direction to both understand the effect of per-example clipping in DP-SGD in model training [SSTT21], and also effective ways to mitigate its impact both in terms of accuracy [PTS21], and training time [ZHS19].</p>
<p>In standard stochastic gradient descent, subsampling is usually used either as a way to speed up the training process [CAR16], or as a a form of regularization [RCR15]. In DP-SGD, the randomness in the subsampling of the minibatch is used to guarantee DP. The technical component for this sort of privacy analysis is called privacy amplification by subsampling [KLNRS08,BBG18]. Since the sampling randomness is used to guarantee DP, it is crucial that the uniformity in the sampling step is of cryptographic strength. Another, (possibly) counterintuitive feature of DP-SGD is that for best privacy/utility trade-off it is in general better to have larger batch sizes. In fact, full-batch DP-gradient descent may provide the best privacy/utility trade-offs, albeit at the expense of computational feasibility.</p>
<p>For a fixed DP guarantee, the magnitude of the Gaussian noise that gets added to the gradient updates in each step in DP-SGD is proportional to \(\sqrt{the\ number\ of\ steps}\) the model is trained for. As a result, it is important to tune the number of training steps for best privacy/utility trade-offs.</p>
<p>In the <a href="https://github.com/tensorflow/privacy/blob/master/tutorials/mnist_dpsgd_tutorial.py">following tutorial</a>, we provide a small code snippet to train a model with DP-SGD.</p>
<h2 id="model-agnostic-private-learning">Model Agnostic Private Learning</h2>
<p>The Sample and Aggregate framework [NRS07] is a generic method to add differential privacy to a non-private algorithm without caring about the internal workings of it, a.k.a. model agnostic. In the context of machine learning, one can state the main idea as follows: Consider a multi-class classification problem. Take the training data, and split into k disjoint subsets of equal size. Train independent models \(\theta_1, \theta_2, …, \theta_k \) on the disjoint subsets. In order to predict on an test example x, first, compute a private histogram over the set of k predictions \(\theta_1(x), \theta_2(x), …, \theta_k(x) \). Then, select and output the bin in the histogram based on the highest count, after adding a small amount of Laplace/Gaussian noise to the counts. In the context of DP learning, this particular approach was used in two different lines of work: i) PATE [PAE16], and ii) Model agnostic private learning [BTT18]. While the latter focussed on obtaining theoretical privacy/utility trade-offs for a class of learning tasks (e.g., agnostic PAC learning), the PATE approach focuses on practical deployment. Both these lines of work make one common observation. If the predictions from \(\theta_1(x), \theta_2(x), …, \theta_k(x) \) are fairly consistent, then the privacy cost in terms of DP is very small. Hence, one can run a large number of prediction queries, without violating DP constraints. In the following, we describe the PATE approach in detail.</p>
<p>The private aggregation of teacher ensembles (PATE) demonstrated in particular that this approach allows one to learn deep neural networks with differential privacy. It proposes to have an ensemble of models trained without privacy predict with differential privacy by having these models predict in aggregate rather than revealing their individual predictions. In PATE, we start by partitioning the private dataset into smaller subsets of data. These subsets are partitions, so there is no overlap between the data included in any pair of partitions. If Jane Smith’s record was in our private dataset, then it is included in one of the partitions only. That is, only one of the teachers has analyzed Jane Smith’s record during training. We train a ML model, called a teacher, on each of these partitions. We now have an ensemble of teacher models that were trained independently, but without any guarantees of privacy. How do we use this ensemble to make predictions that respect privacy? In PATE, we add noise while aggregating the predictions made individually by each teacher to form a single common prediction. We count the number of teachers who voted for each class, and then perturb that count by adding random noise sampled from the Laplace or Gaussian distribution. Each label predicted by the noisy aggregation mechanism comes with rigorous differential privacy guarantees that bound the privacy budget spent to label that input. Again, stronger differential privacy is achieved when we are able to introduce more noise in the aggregation and are able to answer as few queries as possible. Let us now come back to our running example. Imagine that we’d like to use the output of PATE to know if Jane likes a particular movie. The only teacher trained on the partition containing Jane Smith’s data—has now learned that a record similar to Jane’s is characteristic of an individual who likes similar movies, and as a consequence changes its prediction on a test input which is similar to Jane’s to predict the movie rating assigned by Jane. However, because the teacher only contributes a single vote to the aggregation, and that the aggregation injects noise, we won’t be able to know whether the teacher changed its prediction to the movie rating assigned by Jane because the teacher indeed trained on Jane’s data or because the noise injected during the aggregation “flipped” that teacher’s vote. The random noise added to vote counts prevents the outcome of aggregation from reflecting the votes of any individual teachers to protect privacy.</p>
<h2 id="practically-deploying-differential-privacy-in-machine-learning">Practically deploying differential privacy in machine learning</h2>
<p>The two approaches we introduced have the advantage of being conceptually simple to understand. Fortunately, there also exist several open-source implementations of these approaches. For instance, DP-SGD is implemented in TensorFlow Privacy, Objax, and Opacus. This means that one is able to take an existing TensorFlow, JAX, or PyTorch pipeline for training a machine learning model and replace a non-private optimizer with DP-SGD. An example implementation of PATE is also available in TensorFlow Privacy. So what are the concrete potential obstacles to deploying machine learning with differential privacy?</p>
<p>The first obstacle is the accuracy of privacy-preserving models. Datasets are often sampled from distribution with heavy tails. For instance, in a medical application, there are typically (and fortunately) fewer patients with a given medical condition than patients without that condition. This means that there are fewer training examples for patients with each medical condition to learn from. Because differential privacy prevents us from learning patterns which are not found generally across the training data, it limits our ability to learn from these patients for which we have very few examples of [SPG]. More generally, there is often a trade-off between the accuracy of a model and the strength of the differential privacy guarantee it was trained with: the smaller the privacy budget is, the larger the impact on accuracy typically is. That said, this tension is not always inevitable and there are instances where privacy and accuracy are synergical because differential privacy implies generalization [DFH15] (but not vice versa).</p>
<p>The second obstacle to deploying differentially private machine learning can be the computational overhead. For instance, in DP-SGD one must compute per-example gradients rather than average gradients. This often means that optimizations implemented in machine learning frameworks to exploit matrix algebra supported by underlying hardware accelerators (e.g., GPUs) are harder to take advantage of. In another example, PATE requires that one train multiple models (the teachers) rather than a single model so this can also introduce overhead in the training procedure. Fortunately, this cost is mostly mitigated in recent implementations of private learning algorithms, in particular in Objax and Opacus.</p>
<p>The third obstacle to deploying differential privacy, in machine learning but more generally in any form of data analysis, is the choice of privacy budget. The smaller the budget, the stronger the guarantee is. This means one can compare two analyses and say which one is “more private”. However, this also means that it is unclear what is “small enough” of a privacy budget. This is particularly problematic given that applications of differential privacy to machine learning often require a privacy budget that provides little theoretical guarantees in order to train a model whose accuracy is large enough to warrant a useful deployment. Thus, it may be interesting for practitioners to evaluate the privacy of their machine learning algorithm by attacking it themselves. Whereas the theoretical analysis of an algorithm’s differential privacy guarantees provides a worst-case guarantee limiting how much private information the algorithm can leak against any adversary, implementing a specific attack can be useful to know how successful a particular adversary or class of adversaries would be. This helps interpret the theoretical guarantee but may not be treated as a direct substitute for it. Open-source implementations of such attacks are increasingly available: e.g., for membership inference <a href="https://github.com/tensorflow/privacy/tree/master/tensorflow_privacy/privacy/privacy_tests/membership_inference_attack">here</a> and <a href="https://github.com/cchoquette/membership-inference">here</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In the above, we discussed some of the algorithmic approaches towards differentially private model training which have been effective both in theoretical and practical settings. Since it is a rapidly growing field, we could not cover all the important aspects of the research space. Some prominent ones include: i) Choice of the best hyperparameters in the training of DP models.In order to ensure that the overall algorithm preserves differential privacy, one needs to ensure that the choice of hyperparameters itself preserves DP. Recent research has provided algorithms for selecting the best hyperparameters in a differentially private fashion [LT19]. ii) Choice of network architecture: it is not always true that the best known model architectures for non-private model training are indeed the best for training with differential privacy. In particular, we know that the number of model parameters may have adverse effects on the privacy/utility trade-offs [BST14]. Hence, choosing the right model architecture is important for providing a good privacy/utility trade-off [PTS21]. (iii) Training in the federated/distributed setting: in the above exposition, we assumed that the training data lies in a single centralized location. However, in settings like Federated Learning (FL) [MMRHA17], the data records can be highly distributed, e.g., across various mobile devices. Running DP-SGD in the FL setting, which is required for FL to provide privacy guarantees for the training data, raises a series of challenges [KMA19] which are often facilitated by distributed private learning algorithms designed specifically for FL settings [BKMTT20, KMSTTZ21]. Some of the specific challenges in the context of FL include, limited and non-uniform availability of clients (holding individual data records) and unknown (and variable) size of the training data [BKMTT18]. On the other hand, PATE style algorithms lend themselves naturally to the distributed setting once combined with existing cryptographic primitives, as demonstrated by the CaPC protocol [CDD21]. It is an active area of research to address these above challenges.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>The authors would like to thank Thomas Steinke and Andreas Terzis for detailed feedback and edit suggestions. Parts of this blog post previously appeared on <a href="www.cleverhans.io">www.cleverhans.io</a>.</p>
<h2 id="citations">Citations</h2>
<p>[ACG16] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308-318). ACM.</p>
<p>[BBG18] Balle, B., Barthe, G., & Gaboardi, M. (2018). Privacy amplification by subsampling: Tight analyses via couplings and divergences. arXiv preprint arXiv:1807.01647.</p>
<p>[BKMTT18] Balle, B., Kairouz P., McMahan M., Thakkar O. & Thakurta A. (2020). Privacy amplification via random check-ins. In NeurIPS.</p>
<p>[MMRHA17] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.</p>
<p>[KMSTTZ18] Kairouz P., McMahan M., Song S., Thakkar O., Thakurta A., & Xu Z. (2021). Practical and Private (Deep) Learning without Sampling or Shuffling. In ICML.</p>
<p>[BFTT19] Bassily, R., Feldman, V., Talwar, K., & Thakurta, A. Private Stochastic Convex Optimization with Optimal Rates. In NeurIPS 2019.</p>
<p>[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science.</p>
<p>[BTT18] Bassily, R., Thakurta, A. G., & Thakkar, O. D. (2018). Model-agnostic private learning. Advances in Neural Information Processing Systems.</p>
<p>[CDD21] Choquette-Choo, C. A., Dullerud, N., Dziedzic, A., Zhang, Y., Jha, S., Papernot, N., & Wang, X. (2021). CaPC Learning: Confidential and Private Collaborative Learning. arXiv preprint arXiv:2102.05188.</p>
<p>[CMS11] Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).</p>
<p>[CTW20] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Raffel, C. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.</p>
<p>[DFH15] Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). Generalization in adaptive data analysis and holdout reuse. arXiv preprint arXiv:1506.02629.</p>
<p>[DMNS06] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006, March). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (pp. 265-284). Springer, Berlin, Heidelberg.</p>
<p>[DNPR10] Dwork, C., Naor, M., Pitassi, T., & Rothblum, G. N. (2010, June). Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing (pp. 715-724).</p>
<p>[DTTZ14] Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014, May). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing (pp. 11-20).</p>
<p>[HSL20] Huang, Y., Song, Z., Li, K., & Arora, S. (2020, November). Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning (pp. 4507-4518). PMLR.</p>
<p>[HR12] Hardt, M., & Roth, A. (2012, May). Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (pp. 1255-1268).</p>
<p>[HR13] Hardt, M., & Roth, A. (2013, June). Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 331-340).</p>
<p>[JPS16] Johnson, A., Pollard, T., Shen, L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35</p>
<p>[JTT18] Jain, P., Thakkar, O. D., & Thakurta, A. (2018, July). Differentially private matrix completion revisited. In International Conference on Machine Learning (pp. 2215-2224). PMLR.</p>
<p>[INS19] Iyengar, R., Near, J. P., Song, D., Thakkar, O., Thakurta, A., & Wang, L. (2019, May). Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP) (pp. 299-316). IEEE.</p>
<p>[KST12] Kifer, D., Smith, A., & Thakurta, A. (2012, June). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory (pp. 25-1). JMLR Workshop and Conference Proceedings.</p>
<p>[KMA19] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., … & Zhao, S. (2019). Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977.</p>
<p>[KV05] Kalai, Adam, and Santosh Vempala. “Efficient algorithms for online decision problems.” Journal of Computer and System Sciences 71.3 (2005): 291-307.</p>
<p>[KLNRS08] Raskhodnikova, S., Smith, A., Lee, H. K., Nissim, K., & Kasiviswanathan, S. P. (2008). What can we learn privately. In Proceedings of the 54th Annual Symposium on Foundations of Computer Science (pp. 531-540).</p>
<p>[LLV07] Li, N., Li, T., & Venkatasubramanian, S. (2007, April). t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106-115). IEEE.</p>
<p>[LT19] Liu, J., & Talwar, K. (2019, June). Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (pp. 298-309).</p>
<p>[M17] Mironov, I. (2017, August). Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th (pp. 263-275). IEEE.</p>
<p>[MKG07] Machanavajjhala, Ashwin; Kifer, Daniel; Gehrke, Johannes; Venkitasubramaniam, Muthuramakrishnan (March 2007). “L-diversity: Privacy Beyond K-anonymity”. ACM Transactions on Knowledge Discovery from Data.</p>
<p>[NRS07] Nissim, K., Raskhodnikova, S., & Smith, A. (2007, June). Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (pp. 75-84).</p>
<p>[NS08] Narayanan, A., & Shmatikov, V. (2008, May). Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on (pp. 111-125). IEEE.</p>
<p>[PAE16] Papernot, N., Abadi, M., Erlingsson, U., Goodfellow, I., & Talwar, K. (2016). Semi-supervised knowledge transfer for deep learning from private training data. ICLR 2017.</p>
<p>[PTS21] Papernot, N., Thakurta, A., Song, S., Chien, S., & Erlingsson, U. (2020). Tempered sigmoid activations for deep learning with differential privacy. AAAI 2021.</p>
<p>[RCR15] Rudi, A., Camoriano, R., & Rosasco, L. (2015, December). Less is More: Nyström Computational Regularization. In NIPS (pp. 1657-1665).</p>
<p>[SCS13] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, GlobalSIP ’13, pages 245–248, Washington, DC, USA, 2013. IEEE Computer Society.</p>
<p>[SPG] Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings. Vinith Suriyakumar, Nicolas Papernot, Anna Goldenberg, Marzyeh Ghassemi. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.</p>
<p>[SS98] Samarati, Pierangela; Sweeney, Latanya (1998). “Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression” (PDF). Harvard Data Privacy Lab. Retrieved April 12, 2017</p>
<p>[SSS17] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017, May). Membership inference attacks against machine learning models. In Security and Privacy (SP), 2017 IEEE Symposium on (pp. 3-18). IEEE.</p>
<p>[SSTT21] Song, S., Thakkar, O., & Thakurta, A. (2020). Evading the Curse of Dimensionality in Unconstrained Private GLMs. In AISTATS 2021.</p>
<p>[XT07] Xiao X, Tao Y (2007) M-invariance: towards privacy preserving re-publication of dynamic datasets. In: SIGMOD conference, Beijing, China, pp 689–700</p>
<p>[YEO21] Yala, A., Esfahanizadeh, H., Oliveira, R. G. D., Duffy, K. R., Ghobadi, M., Jaakkola, T. S., … & Medard, M. (2021). NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training. arXiv preprint arXiv:2106.02484.</p>
<p>[ZHS19] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.</p>
Nicolas PapernotAbhradeep ThakurtaMon, 25 Oct 2021 11:00:00 -0700
https://differentialprivacy.org/how-to-deploy-ml-with-dp/
https://differentialprivacy.org/how-to-deploy-ml-with-dp/One-shot DP Top-k mechanisms<p>In the last <a href="https://differentialprivacy.org/exponential-mechanism-bounded-range/"><em>blog post</em></a>, we showed that the exponential mechanism enjoys improved composition bounds over general pure DP mechanisms due to a property called <strong>bounded range</strong>. For this post, we will present another useful, and somewhat surprising, property of the exponential mechanism in its application of top-\(k\) selection.</p>
<h2 id="differentially-private-top-k-selection">Differentially Private Top-\(k\) Selection</h2>
<p>We will focus on datasets that are a vector of counts \(h = (h_1, \cdots, h_d) \in \mathbb{N}^d\), which consist of counts \(h_i\) for elements from a universe \(\mathcal{U}\) where \(|\mathcal{U}| = d\). Let’s assume that a user’s data can modify each count by at most 1, yet can change all \(d\) counts, i.e. the \(\ell_\infty\)-sensitivity is 1 and the \(\ell_0\)-sensitivity is \(d\). The task here is to return the top-\(k\) elements from the input counts in a differentially private way.</p>
<p>For top-\(1\), this is simply returning the element with the max count, and this is precisely the problem that the exponential mechanism is set up to solve. Let’s write out the exponential mechanism \(M^{(1)}: \mathbb{N}^d \to [d]\) for this instance:
\[
\mathbb{P}[M^{(1)}(h) = i] = \frac{e^{ \varepsilon h_i }}{\sum_{j \in [d] } e^{ \varepsilon h_j } }, \qquad \forall i \in [d].
\]
For those wondering why this formula omits the factor of \(1/2\) in the exponent, we are using the <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">stronger result</a> of the exponential mechanism which replaces global sensitivity with the range of the loss function \(\ell(i,h) = - h_i\), which is \(1\) in this case. Recall from the last blog post that the exponential mechanism is \(\varepsilon^2/8\)-CDP.</p>
<p>Hence, to generalize this to top-\(k\) selection, we can simply iteratively apply this exponential mechanism by removing the <em>discovered</em> element from each previous round. That is, we write \(M^{(k)}: \mathbb{N}^d \to [d]^k\) as the following for any outcome \( (i_1, i_2, \cdots, i_k) \in [d]^k\),</p>
<p>\[
\mathbb{P}[M^{(k)}(h) = (i_1, i_2, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
<a name="eq:peelingEM"></a>
\[\qquad = \frac{e^{ \varepsilon h_{i_1} }}{\sum_{j \in [d] } e^{ \varepsilon h_j } } \cdot \frac{ e^{\varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1\}} e^{\varepsilon h_j } } \cdot \cdots \cdot \frac{ e^{ \varepsilon h_{i_2} } }{\sum_{j\in [d]\setminus \{ i_1, \cdots, i_{k-1}\}} e^{ \varepsilon h_j } }.
\tag{1}
\]</p>
<p>We can then apply composition to conclude that \(M^{(k)}\) is \(k \varepsilon^2/8\)-CDP.</p>
<h2 id="gumbel-noise-and-the-exponential-mechanism">Gumbel Noise and the Exponential Mechanism</h2>
<p>As we discussed in our last post, we can implement the exponential mechanism by adding <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel</a> noise to each count and reporting the noisy max element. A Gumbel random variable \(X \sim \text{Gumbel}(\beta) \), parameterized by scale parameter \(\beta>0\), has the following density function
<a name="eq:GumbelDensity"></a>
\[
p(x;\beta) = \frac{1}{\beta} \exp\left( - x/\beta - e^{-x/\beta} \right), \qquad \forall x \in \mathbb{R}.
\tag{2}
\]</p>
<p>Hence, we can write the exponential mechanism in the following way
\[
M^{(1)}(h) = \arg\max \{ h_i + X_i : i \in [d] \}, \qquad \{X_i \} \stackrel{i.i.d.}{\sim} \text{Gumbel}(1/\varepsilon).
\]</p>
<p>We can then extend this to top-\(k\) by repeatedly adding independent Gumbel noise to each count and removing the discovered element for the next round. However, something that would significantly improve run time would be to add Gumbel noise to each count <em>once</em> and then take the elements with the top-\(k\) noisy counts. We could then add only \(d\) many noise terms, rather than \(O(d^k)\) noise terms if we were to iteratively run \(k\) different exponential mechanisms. The question is, does this one-shot top-\(k\) Gumbel noise mechanism ensure the same level of privacy?</p>
<p>Let’s denote the one-shot Gumbel mechanism as \(\tilde{M}^{(k)}\). At first glance, it does not seem like the one-shot Gumbel mechanism \(\tilde{M}^{(k)}\) should be just as private as the iterative exponential mechanism \(M^{(k)}\), but it turns out they are exactly the same mechanism! The following result is due to <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a>.</p>
<blockquote>
<p><strong>Theorem 1</strong>
For any input vector of counts \(h \in \mathbb{N}^d\), the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) and iteratively applying the exponential mechanism \(M^{(k)}(h)\) are equal in distribution.</p>
</blockquote>
<p><em>Proof.</em>
Recall the distribution of the iterative exponential mechanism \(M^{(k)}(h)\) from <a href="#eq:peelingEM">(1)</a>.
Now we consider the one-shot Gumbel mechanism \(\tilde{M}^{(k)}(h)\) where we use the density of \(X \sim \) Gumbel\( (1/\varepsilon)\) from <a href="#eq:GumbelDensity">(2)</a>.<br />
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \int_{-\infty}^\infty p(u_1 - h_{i_1}) \int_{-\infty}^{u_1} p(u_2 - h_{i_2}) \cdots \int_{-\infty}^{u_{k-1}} p(u_k - h_k)
\]
<a name="eq:integral"></a>
\[
\qquad \qquad \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[ X < u_k - h_j]du_k \cdots du_2 du_1.
\tag{3}
\]
Note that we have
\[
\mathbb{P}[X < y] = \exp\left( - \exp\left( -\varepsilon y \right) \right).
\]
Let’s focus on the inner integral over \(u_k\) in <a href="#eq:integral">(3)</a>.<br />
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\quad = \int_{-\infty}^{u_{k-1}} \varepsilon \cdot \exp\left( - \varepsilon (u_k - h_{i_k}) - e^{ -\varepsilon (u_k - h_{i_k}) } \right) \cdot \exp\left( -e^{-\varepsilon u_k} \sum_{j \in [d] \setminus \{i_1, \cdots, i_k \} } e^{\varepsilon h_j} \right) du_k
\]
\[
\quad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left( e^{\varepsilon h_{i_k}} + \sum_{j \in [d] \setminus \{i_1, \cdots i_k \} } e^{\varepsilon h_j} \right) \right) du_k \qquad
\]
<a name="eq:lastLine"></a>
\[
\qquad = \varepsilon e^{\varepsilon h_{i_k}} \int_{-\infty}^{u_{k-1}} \exp\left( -\varepsilon u_k - e^{-\varepsilon u_k} \left(\sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \right) \right) du_k. \qquad \qquad
\tag{4}
\]
We now integrate with a \(v\)-substitution,
\[
v =e^{-\varepsilon u_{k}} \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j}<br />
\]
\[
dv = - \varepsilon \sum_{j \in [d] \setminus \{i_1, \cdots i_{k-1} \} } e^{\varepsilon h_j} \cdot e^{-\varepsilon u_{k}} du_{k}.
\]</p>
<p>Continuing with <a href="#eq:lastLine">(4)</a>, we get
\[
\int_{-\infty}^{u_{k-1}} p(u_k - h_{i_k} )\prod_{j \in [d] \setminus \{i_1, \cdots, i_k \} } \mathbb{P}[X < u_k - h_j ]du_k \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \exp\left( - e^{-\varepsilon u_{k-1}} \cdot \sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j} \right)
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_k} }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}} \cdot \prod_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \} } \mathbb{P}[X < u_{k-1} - h_j ] .
\]
Note how this line has the last term in the expression for \(M^{(k)}(h)\) in <a href="#eq:peelingEM">(1)</a>, which is independent of \(u_{k-1}\) and can hence be pulled out of the larger integral in <a href="#eq:integral">(3)</a>. By induction, we have
\[
\mathbb{P}[\tilde{M}^{(k)}(h) = (i_1, \cdots, i_k)] \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad
\]
\[
\qquad = \frac{e^{\varepsilon h_{i_1} }}{\sum_{j \in [d]} e^{\varepsilon h_j}} \cdot \frac{e^{\varepsilon h_{i_2} }}{\sum_{j \in [d] \setminus \{ i_1\}} e^{\varepsilon h_j}} \cdot \cdots \cdot \frac{e^{\varepsilon h_k }}{\sum_{j \in [d] \setminus \{i_1, \cdots, i_{k-1} \}} e^{\varepsilon h_j}}
\]
\[
\qquad = \mathbb{P}[M^{(k)}(h) =(i_1, \cdots, i_k)].
\] ∎</p>
<p>So that’s great! We can now run the one-shot Gumbel mechanism for top-\(k\) and still get the improved composition bounds of the exponential mechanism. In addition to this achieving better runtime, this analysis can help with proving top-\(k\) DP algorithms over a large domain universe despite giving access to only the true top-\(\bar{k}\) items and their counts where \(\bar{k} > k \), see <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> for more details.</p>
<h2 id="report-noisy-max-for-dp-top-k">Report Noisy Max for DP Top-\(k\)</h2>
<p>We now turn to comparing this algorithm to some natural alternatives. As we discussed in the last post, there is a family of mechanisms, report noisy max (RNM) mechanisms, that ensure differential privacy for the selection problem, and hence the top-\(1\) problem. We showed that the exponential mechanism is equivalent to RNM with Gumbel noise, there is also RNM with Laplace and with Exponential noise, the last being the recently discovered <em>permute-and-flip</em> mechanism <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>.</p>
<p>To then use RNM mechanisms for top-\(k\), we can again iteratively apply them and use composition to get the overall privacy guarantee. However, it turns out that you can also use the Laplace noise version of RNM in one-shot <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>.</p>
<p>We can compare the relative noise that is added to each count in both the Laplace and Gumbel versions. Since <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a> gives their privacy guarantee in terms of approximate \((\varepsilon,\delta )\)-DP, we will now make the comparison there. We first look at the standard deviation for Laplace \( \sigma_{\text{Lap}}\) (using Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>).
\[
\sigma_{\text{Lap}} = \frac{8 \sqrt{2k \ln(d/\delta)}}{\varepsilon}.
\]
Note that the one-shot Laplace mechanism returns counts as well as the indices of the top-\(k\), both of which use Laplace noise with standard deviation \(\sigma_{\text{Lap}}\), so we will also include Laplace noise to the discovered elements in the Gumbel version. That is, we add Gumbel noise with scale \(\sqrt{k}/\varepsilon’\) for the discovery portion and Laplace noise with scale \(2\sqrt{k}/\varepsilon’\) for obtaining their counts, resulting in standard deviation noise \(\sigma_{\text{Gumb}}’\) and \(\sigma_{\text{Lap}}’\), respectively.<br />
\[
\sigma_{\text{Gumb}}’ = \frac{\pi \sqrt{k} }{\sqrt{6}\varepsilon’}, \qquad
\sigma_{\text{Lap}}’ = \frac{2\sqrt{2k}}{\varepsilon’}.
\]
Recall that adding this scale of Gumbel noise and Laplace noise will ensure \(\tfrac{\varepsilon’^2}{8} \)-CDP each, so combining will ensure \(\tfrac{\varepsilon’^2}{4}\)-CDP. We could also use Gaussian noise to return the counts since we are using CDP, but we will analyze it with Laplace noise for comparison. To ensure \((\varepsilon,\delta)\)-DP, we use the CDP to DP conversion from Lemma 3.5 in <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and solve for \(\varepsilon’\). Hence, we get for any \(\delta>0\)
\[
\varepsilon’^2/4 = \left( \sqrt{\ln(1/\delta) + \varepsilon} - \sqrt{\ln(1/\delta)} \right)^2
\]
\[ \implies \varepsilon’ = 2 \sqrt{\ln(1/\delta)} \left( \sqrt{1 + \tfrac{\varepsilon}{\ln(1/\delta)}} - 1 \right).
\]</p>
<p>Let’s consider a typical privacy setting where \(\varepsilon < \ln(1/\delta)\), and use the inequality \(\sqrt{1+x} \geq 1 + x/4\) for \(0<x<1\). Here is a short proof of this inequality:
\[
(1 + x/4)^2 = 1 + x/2 + x^2/16 \leq 1 + x/2 + x/2 = 1 + x.<br />
\]
Note that the privacy guarantee for one-shot Laplace noise only holds when \(\varepsilon < 0.2\) and \(\delta < 0.05\) as stated in Theorem 2.2 in <a href="https://arxiv.org/abs/2105.08233" title="Gang Qiao, Weijie J. Su, Li Zhang. Oneshot Differentially Private Top-k Selection. ICML 2021."><strong>[QSZ21]</strong></a>. In this case, we have
\[
\varepsilon’ \geq 2 \sqrt{\ln(1/\delta)} \left( 1 + 1/4 \cdot \tfrac{\varepsilon}{\ln(1/\delta)} - 1 \right) = 1/2 \cdot \tfrac{\varepsilon}{\sqrt{\ln(1/\delta)}}.
\]
Plugging \(\varepsilon’\) into the standard deviation of Gumbel and Laplace, we get
\[
\sigma_{\text{Gumb}}’ \leq \frac{2\pi\sqrt{k \ln(1/\delta)}}{\sqrt{6}\cdot \varepsilon}, \qquad \sigma_{\text{Lap}}’ \leq \frac{4\sqrt{2k\ln(1/\delta)}}{\varepsilon}.
\]</p>
<p>Putting this together, we can show that we add significantly less noise for the discovery and releasing noisy count phases,
\[
\sigma_{\text{Gumb}}’\leq \sigma_{\text{Lap}}/4 ,\qquad \sigma_{\text{Lap}}’ \leq \sigma_{\text{Lap}}/2.
\]
Note that these bounds can be improved further with similar analysis.</p>
<p>Although it has not been studied yet whether the permute-and-flip mechanism \(M_{\text{PF}} \) can also ensure DP in one shot by using Exponential noise, we briefly discuss whether it can be bounded range for a similar parameter as the Exponential Mechanism, and hence achieve similar composition bounds. Consider running permute-and-flip on two items \(\{1,2 \}\) with a monotonic quality score \(q: \mathcal{X} \times \{1,2 \} \to \mathbb{R} \) whose sensitivity is 1. Let \(x, x’ \in \mathcal{X} \) be neighbors where
\[
q(x,1) = q(x,2) = 0
\]
\[
q(x’,1) = 0 , \quad q(x’,2) = 1.
\]
Hence, permute-and-flip will return outcome \(1\) or \(2\) with half probability each on dataset \(x\), while with dataset \(x’\) outcome \(1\) occurs with probability \(1/2 \cdot e^{-\varepsilon}\) and outcome \(2\) occurs with probability \( 1/2 + 1/2 \cdot (1 - e^{-\varepsilon})\). We can then compute the bounded range parameter \(\alpha\) as
\[
\frac{\mathbb{P}[M_{\text{PF}}(x’) = 2 ] }{\mathbb{P}[M_{\text{PF}}(x) = 2 ]}\leq e^{\alpha}\frac{\mathbb{P}[M_{\text{PF}}(x’) = 1 ]}{\mathbb{P}[M_{\text{PF}}(x) = 1 ]} \implies \alpha \geq \varepsilon + \ln(2 - e^{-\varepsilon} ).
\]
Note that with \(\varepsilon \gg 1\), we get \(\alpha \) close to \(\varepsilon\), which would be the same bounded range parameter as the exponential mechanism. However, with \(\varepsilon< 1\), we get \(\alpha\) close to \(2\varepsilon\). This example provides a lower bound on the BR parameter for permute-and-flip.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have looked at the top-\(k\) selection problem subject to differential privacy and although there are many different mechanisms to use, the exponential mechanism stands out for several reasons:</p>
<ol>
<li>The exponential mechanism is \(\varepsilon\)-DP and \( \varepsilon^2/8\)-CDP and hence gets improved composition.</li>
<li>Iteratively applying the exponential mechanism for top-\(k\) can be implemented by adding Gumbel noise to each count and returning the elements with the top-\(k\) noisy counts in one-shot.</li>
<li>The one-shot Gumbel mechanism returns a ranked list of \(k\) elements, rather than a set of \(k\) elements.</li>
</ol>
David DurfeeRyan RogersMon, 09 Aug 2021 10:00:00 -0700
https://differentialprivacy.org/one-shot-top-k/
https://differentialprivacy.org/one-shot-top-k/A Better Privacy Analysis of the Exponential Mechanism<p>A basic and frequent task in data analysis is <em>selection</em> – given a set of options \(\mathcal{Y}\), output the (approximately) best one, where “best” is defined by some loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) and a dataset \(x \in \mathcal{X}^n\). That is, we want to output some \(y \in \mathcal{Y}\) that approximately minimizes \(\ell(y,x)\). Naturally, we are interested in <em>private selection</em> – i.e., the output should be differentially private in terms of the dataset \(x\).
This post discusses algorithms for private selection – in particular, we give an improved privacy analysis of the popular exponential mechanism.</p>
<h2 id="the-exponential-mechanism">The Exponential Mechanism</h2>
<p>The most well-known algorithm for private selection is the <a href="https://en.wikipedia.org/wiki/Exponential_mechanism_(differential_privacy)"><em>exponential mechanism</em></a> <a href="https://doi.org/10.1109/FOCS.2007.66" title="Frank McSherry, Kunal Talwar. Mechanism Design via Differential Privacy. FOCS 2007."><strong>[MT07]</strong></a>. The exponential mechanism \(M : \mathcal{X}^n \to \mathcal{Y} \) is a randomized algorithm given by \[\forall x \in \mathcal{X}^n ~ \forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x) = y] = \frac{\exp(-\frac{\varepsilon}{2\Delta} \ell(y,x))}{\sum_{y’ \in \mathcal{Y}} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x)) }, \tag{1}\] where \(\Delta\) is the sensitivity of the loss function \(\ell\) given by \[\Delta = \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \max_{y\in\mathcal{Y}} |\ell(y,x) - \ell(y,x’)|,\tag{2}\] where the supremum is taken over all datasets \(x\) and \(x’\) differing on the data of a single individual (which we denote by \(d(x,x’)\le 1\)).</p>
<p>In terms of utility, we can easily show that <a href="https://arxiv.org/abs/1511.02513" title="Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman. Algorithmic Stability for Adaptive Data Analysis. STOC 2016."><strong>[BNSSSU16]</strong></a> \[\mathbb{E}[\ell(M(x),x)] \le \min_{y \in \mathcal{Y}} \ell(y,x) + \frac{2\Delta}{\varepsilon} \log |\mathcal{Y}|\] for all \(x \in \mathcal{X}^n\) (and we can also give high probability bounds).</p>
<p>It is easy to show that the exponential mechanism satisfies \(\varepsilon\)-differential privacy.
But there is more to this story! We’re going to look at a more refined privacy analysis.</p>
<h2 id="bounded-range">Bounded Range</h2>
<p>The privacy guarantee of the exponential mechanism is more precisely characterized by <em>bounded range</em>. This was observed and defined by David Durfee and Ryan Rogers <a href="https://arxiv.org/abs/1905.04273" title="David Durfee, Ryan Rogers. Practical Differentially Private Top-k Selection with Pay-what-you-get Composition. NeurIPS 2019"><strong>[DR19]</strong></a> and further analyzed later <a href="https://arxiv.org/abs/1909.13830" title="Jinshuo Dong, David Durfee, Ryan Rogers. Optimal Differential Privacy Composition for Exponential Mechanisms. ICML 2020."><strong>[DDR20]</strong></a>.</p>
<blockquote>
<p><strong>Definition 1 (Bounded Range).</strong><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\eta\)-bounded range if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, there exists some \(t \in \mathbb{R}\) such that \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \in [t, t+\eta].\] Here \(t\) may depend on the pair of input datasets \(x,x’\), but not on the output \(y\).</p>
</blockquote>
<p>To interpret this definition, we <a href="/flavoursofdelta/">recall the definition of the privacy loss random variable</a>: Define \(f : \mathcal{Y} \to \mathbb{R}\) by \[f(y) = \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right).\] Then the privacy loss random variable \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is given by \(Z = f(M(x))\).</p>
<p>Pure \(\varepsilon\)-differential privacy is equivalent to demanding that the privacy loss is bounded by \(\varepsilon\) – i.e., \(\mathbb{P}[|Z|\le\varepsilon]=1\). Approximate \((\varepsilon,\delta)\)-differential privacy is, roughly, equivalent to demanding that \(\mathbb{P}[Z\le\varepsilon]\ge1-\delta\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>Now \(\eta\)-bounded range is simply demanding that the privacy loss \(Z\) is supported on some interval of length \(\eta\). This interval \([t,t+\eta]\) may depend on the pair \(x,x’\).</p>
<p>Bounded range and pure differential privacy are equivalent up to a factor of 2 in the parameters:</p>
<blockquote>
<p><strong>Lemma 2 (Bounded Range versus Pure Differential Privacy).</strong></p>
<ul>
<li>\(\varepsilon\)-differential privacy implies \(\eta\)-bounded range with \(\eta \le 2\varepsilon\).</li>
<li>\(\eta\)-bounded range implies \(\varepsilon\)-differential privacy with \(\varepsilon \le \eta\).</li>
</ul>
</blockquote>
<p><em>Proof.</em> The first part of the equivalence follows from the fact that pure \(\varepsilon\)-differential privacy implies the privacy loss is supported on the interval \([-\varepsilon,\varepsilon]\). Thus, if we set \(t=-\varepsilon\) and \(\eta=2\varepsilon\), then \([t,t+\eta] = [-\varepsilon,\varepsilon]\).
The second part follows from the fact that the support of the privacy loss \([t,t+\eta]\) must straddle \(0\). That is, the privacy loss cannot be always positive nor always negative, so \(0 \in [t,t+\eta]\) and, hence, \([t,t+\eta] \subseteq [-\eta,\eta]\). Otherwise \(\forall y ~ f(y)>0\) or \(\forall y ~ f(y)<0\) would imply \(\forall y ~ \mathbb{P}[M(x)=y]>\mathbb{P}[M(x’)=y]\) or \(\forall y ~ \mathbb{P}[M(x)=y]<\mathbb{P}[M(x’)=y]\), contradicting the fact that \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x)=y] = 1\) and \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x’)=y] = 1\). ∎</p>
<p>OK, back to the exponential mechanism:</p>
<blockquote>
<p><strong>Lemma 3 (The Exponential Mechanism is Bounded Range).</strong>
The exponential mechanism (given in Equation 1 above) satisfies \(\varepsilon\)-bounded range .<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
</blockquote>
<p><em>Proof.</em>
We have \[e^{f(y)} = \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x))}{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x’))} \cdot \frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}.\]
Setting \(t = \log\left(\frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}\right) - \frac{\varepsilon}{2}\), we have \[ f(y) = \frac{\varepsilon}{2\Delta} (\ell(y,x’)-\ell(y,x)+\Delta) + t.\]
By the definition of sensitivity (given in Equation 2), we have \( 0 \le \ell(y,x’)-\ell(y,x)+\Delta \le 2\Delta\), whence \(t \le f(y) \le t + \varepsilon\). ∎</p>
<p>Bounded range is not really a useful privacy definition on its own. Thus we’re going to relate it to a relaxed version of differential privacy next.</p>
<h2 id="concentrated-differential-privacy">Concentrated Differential Privacy</h2>
<p>Concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a> and its variants <a href="https://arxiv.org/abs/1603.01887" title="Cynthia Dwork, Guy N. Rothblum. Concentrated Differential Privacy. 2016."><strong>[DR16]</strong></a> <a href="https://arxiv.org/abs/1702.07476" title="Ilya Mironov. Rényi Differential Privacy. CCS 2017."><strong>[M17]</strong></a> are relaxations of pure differential privacy with many nice properties. In particular, it composes very cleanly.</p>
<blockquote>
<p><strong>Definition 4 (Concentrated Differential Privacy).</strong>
A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual,
\[\forall \lambda > 0 ~~~~~ \mathbb{E}[\exp( \lambda Z)] \le \exp(\lambda(\lambda+1)\rho),\tag{3}\]
where \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is the privacy loss random variable.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
</blockquote>
<p>Intuitively, concentrated differential privacy requires that the privacy loss is subgaussian. Specifically, the bound on the moment generating function of \(\rho\)-concentrated differential privacy is tight if the privacy loss \(Z\) follows the distribution \(\mathcal{N}(\rho,2\rho)\). Indeed, the privacy loss random variable of the Gaussian mechanism has such a distribution.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
<p>OK, back to the exponential mechanism:
We know that \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>.
This, of course, applies to the exponential mechaism. A cool fact – that we want to draw more attention to – is that we can do better!
Specifically, \(\eta\)-bounded range implies \(\frac18 \eta^2\)-concentrated differential privacy <a href="https://arxiv.org/abs/2004.07223" title="Mark Cesar, Ryan Rogers. Bounding, Concentrating, and Truncating: Unifying Privacy Loss Composition for Data Analytics. ALT 2021."><strong>[CR21]</strong></a>.
What follows is a proof of this fact following that of Mark Cesar and Ryan Rogers, but with some simplification.</p>
<blockquote>
<p><strong>Theorem 5 (Bounded Range implies Concentrated Differential Privacy).</strong>
If \(M\) is \(\eta\)-bounded range, then it is \(\frac18\eta^2\)-concentrated differentially private.</p>
</blockquote>
<p><em>Proof.</em>
Fix datasets \(x,x’ \in \mathcal{X}^n\) differing on a single individual’s data.
Let \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) be the privacy loss random variable of the mechanism \(M\) on this pair of datasets.
By the definition of bounded range (Definition 1), there exists some \(t \in \mathbb{R}\) such that \(Z \in [t, t+\eta]\) with probability 1.
Now we employ <a href="https://en.wikipedia.org/wiki/Hoeffding%27s_lemma">Hoeffding’s Lemma</a> <a href="https://doi.org/10.1080%2F01621459.1963.10500830" title="Wassily Hoeffding. Probability inequalities for sums of bounded random variables. JASA 1963."><strong>[H63]</strong></a>:</p>
<blockquote>
<p><strong>Lemma 6 (Hoeffding’s Lemma).</strong>
Let \(X\) be a random variable supported on the interval \([a,b]\). Then, for all \(\lambda \in \mathbb{R}\), we have \[\mathbb{E}[\exp(\lambda X)] \le \exp \left( \mathbb{E}[X] \cdot \lambda + \frac{(b-a)^2}{8} \cdot \lambda^2 \right).\]</p>
</blockquote>
<p>Applying the lemma to the privacy loss gives \[\forall \lambda \in \mathbb{R} ~~~~~ \mathbb{E}[\exp(\lambda Z)] \le \exp \left( \mathbb{E}[Z] \cdot \lambda + \frac{\eta^2}{8} \cdot \lambda^2 \right).\]
The only remaining thing we need is to show is that \(\mathbb{E}[Z] \le \frac18 \eta^2\).<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup></p>
<p>If we set \(\lambda = -1 \), then we get \( \mathbb{E}[\exp( - Z)] \le \exp \left( -\mathbb{E}[Z] + \frac{\eta^2}{8} \right)\), which rearranges to \(\mathbb{E}[Z] \le \frac18 \eta^2 - \log \mathbb{E}[\exp( - Z)]\).
Now we have \[ \mathbb{E}[\exp( - Z)] \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \exp(-f(y)) \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \!\cdot\! \frac{\mathbb{P}[M(x’)\!=\!y]}{\mathbb{P}[M(x)\!=\!y]} \!=\! 1.\]
∎</p>
<p>This brings us to the TL;DR of this post:</p>
<blockquote>
<p><strong>Corollary 7.</strong> The exponential mechanism (given by Equation 1) is \(\frac18 \varepsilon^2\)-concentrated differentially private.</p>
</blockquote>
<p>This is great news. The standard analysis only gives \(\frac12 \varepsilon^2\)-concentrated differential privacy. Constants matter when applying differential privacy, and we save a factor of 4 in the concentrated differential privacy analysis of the exponential mechanism for free with this improved analysis.</p>
<p>Combining Lemma 2 with Theorem 5 also gives a simpler proof of the conversion from pure differential privacy to concentrated differential privacy <a href="https://arxiv.org/abs/1605.02065" title="Mark Bun, Thomas Steinke. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. TCC 2016."><strong>[BS16]</strong></a>:</p>
<blockquote>
<p><strong>Corollary 8.</strong> \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy.</p>
</blockquote>
<h2 id="beyond-the-exponential-mechanism">Beyond the Exponential Mechanism</h2>
<p>The exponential mechanism is not the only algorithm for private selection. A closely-related algorithm is <em>report noisy max/min</em>:<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup> Draw independent noise \(\xi_y\) from some distribution for each \(y \in \mathcal{Y}\) then output \[M(x) = \underset{y \in \mathcal{Y}}{\mathrm{argmin}} ~ \ell(y,x) - \xi_y.\]</p>
<p>If the noise distribution is an appropriate <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a>, then report noisy max is exactly the exponential mechanism. (This equivalence is known as the “Gumbel max trick.”)</p>
<p>We can also use the Laplace distribution or the exponential distribution. Report noisy max with the exponential distribution is equivalent to the <em>permute and flip</em> algorithm <a href="https://arxiv.org/abs/2010.12603" title="Ryan McKenna, Daniel Sheldon. Permute-and-Flip: A new mechanism for differentially private selection
. NeurIPS 2020."><strong>[MS20]</strong></a> <a href="https://arxiv.org/abs/2105.07260" title="Zeyu Ding, Daniel Kifer, Sayed M. Saghaian N. E., Thomas Steinke, Yuxin Wang, Yingtai Xiao, Danfeng Zhang. The Permute-and-Flip Mechanism is Identical to Report-Noisy-Max with Exponential Noise. 2021."><strong>[DKSSWXZ21]</strong></a>. However, these algorithms don’t enjoy the same improved bounded range and concentrated differential privacy guarantees as the exponential mechanism.</p>
<p>There are also other variants of the selection problem. For example, in some cases we can assume that only a few options have low loss and the rest of the options have high loss – i.e., there is a gap between the minimum loss and the second-lowest loss (or, more generally, the \(k\)-th lowest loss). In this case there are algorithms that attain better accuracy than the exponential mechanism under relaxed privacy definitions <a href="https://arxiv.org/abs/1409.2177" title="Kamalika Chaudhuri, Daniel Hsu, Shuang Song. The Large Margin Mechanism for Differentially Private Maximization. NIPS 2014."><strong>[CHS14]</strong></a> <a href="https://dl.acm.org/doi/10.1145/3188745.3188946" title=" Mark Bun, Cynthia Dwork, Guy N. Rothblum, Thomas Steinke. Composable and versatile privacy via truncated CDP. STOC 2018."><strong>[BDRS18]</strong></a> <a href="https://arxiv.org/abs/1905.13229" title="Mark Bun, Gautam Kamath, Thomas Steinke, Zhiwei Steven Wu. Private Hypothesis Selection. NeurIPS 2019."><strong>[BKSW19]</strong></a>.</p>
<p>There are a lot of interesting aspects of private selection, including questions for further research! We hope to have further posts about some of these topics.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>For simplicity, we restrict our discussion here to finite sets of outputs, although the definitions, algorithms, and results can be extended to infinite sets. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>To be more precise, \((\varepsilon,\delta)\)-differential privacy is equivalent to demanding that \(\mathbb{E}[\max\{0,1-\exp(\varepsilon-Z)\}]\le\delta\) <a href="https://arxiv.org/abs/2004.00010" title="Clément L. Canonne, Gautam Kamath, Thomas Steinke. The Discrete Gaussian for Differential Privacy. NeurIPS 2020."><strong>[CKS20]</strong></a>. (To be completely precise, we must appropriately deal with the \(Z=\infty\) case, which we ignore in this discussion for simplicity.) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>This proof actually gives <a href="https://dongjs.github.io/2020/02/10/ExpMech.html">a slightly stronger result</a>: We can replace the sensitivity \(\Delta\) (defined in Equation 2) by half the range \[\hat\Delta = \frac12 \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \left( \max_{\overline{y}\in\mathcal{Y}} \ell(\overline{y},x) - \ell(\overline{y},x’) - \min_{\underline{y}\in\mathcal{Y}} \ell(\underline{y},x) - \ell(\underline{y},x’) \right).\] We always have \(\hat\Delta \le \Delta\) but it is possible that \(\hat\Delta < \Delta\) and the privacy analysis of the exponential mechanism still works if we replace \(\Delta\) by \(\hat\Delta\). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Equivalently, a randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, \[\forall \lambda > 0 ~~~~~ \mathrm{D}_{\lambda+1}(M(x)\|M(x’)) \le \lambda(\lambda+1)\rho,\] where \(\mathrm{D}_{\lambda+1}(M(x)\|M(x’)))\) is the order \(\lambda+1\) Rényi divergence of \(M(x)\) from \(M(x’)\). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>To be precise, if \(M(x) = q(x) + \mathcal{N}(0,\sigma^2I)\), then \(M : \mathcal{X}^n \to \mathbb{R}^d\) satisfies \(\frac{\Delta_2^2}{2\sigma^2}\)-concentrated differential privacy, where \(\Delta_2 = \sup_{x,x’\in\mathcal{X}^n : d(x,x’)\le1} \|q(x)-q(x’)\|_2\) is the 2-norm sensitivity of \(q:\mathcal{X}^n \to \mathbb{R}^d\). Furthermore, the privacy loss of the Gaussian mechanism is itself a Gaussian and it makes the inequality defining concentrated differential privacy (Equation 3) an equality for all \(\lambda\) <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Note that the expectation of the privacy loss is simply the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL divergence</a>: \(\mathbb{E}[Z] = \mathrm{D}_1( M(x) \| M(x’) )\). <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>We have presented selection here in terms of minimization, but most of the literature is in terms of maximization. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Ryan RogersThomas SteinkeMon, 12 Jul 2021 10:00:00 -0700
https://differentialprivacy.org/exponential-mechanism-bounded-range/
https://differentialprivacy.org/exponential-mechanism-bounded-range/Open Problem - Optimal Query Release for Pure Differential Privacy<p>Releasing large sets of statistical queries is a centerpiece of the theory of differential privacy. Here, we are given a <em>dataset</em> \(x = (x_1,\dots,x_n) \in [T]^n\), and a set of <em>statistical queries</em> \(f_1,\dots,f_k\), where each query is defined by some bounded function \(f_j : [T] \to [-1,1]\), and (abusing notation) is defined as
\[
f_j(x) = \frac{1}{n} \sum_{i=1}^{n} f_j(x_i).
\]
We use \(f(x) = (f_1(x),\dots,f_k(x))\) to denote the vector consisting of the true answers to all these queries.
Our goal is to design an \((\varepsilon, \delta)\)-differentially private algorithm \(M\) that takes a dataset \(x\in [T]^n\) and outputs a random vector \(M(x)\in \mathbb{R}^k\) such that \(\| M(x) - f(x) \|\) is small in expectation for some norm \(\|\cdot\|\). Usually algorithms for this problem also give high probability bounds on the error, but we focus on expected error for simplicity.</p>
<p>This problem has been studied for both <em>pure differential privacy</em> (\(\delta = 0\)) and <em>appproximate differential privacy</em> (\(\delta > 0\)), and for both \(\ell_\infty\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{\infty} ) \leq \alpha,
\]
and \(\ell_2\)-error
\[
\mathbb{E}( \| M(x) - f(x)\|_{2} ) \leq \alpha k^{1/2},
\]
giving four variants of the problem. By now we know tight worst-case upper and lower bounds for two of these variants, and nearly tight bounds (up to logarithmic factors) for a third. The tightest known upper bounds are given in the following table.</p>
<table>
<tbody>
<tr>
<td> </td>
<td>Pure DP</td>
<td>Approx DP</td>
</tr>
<tr>
<td>\( \ell_2 \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log^2 k ~\cdot~ \log^{3/2}T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://arxiv.org/abs/1212.0297">NTZ13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/drv10.pdf">DRV10</a>]</td>
</tr>
<tr>
<td>\( \ell_\infty \)<br />error</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log T}{\varepsilon n} \right)^{1/3} \) <br /> [<a href="https://arxiv.org/abs/1109.2229">BLR13</a>]</td>
<td>\( \alpha \lesssim \left(\frac{\log k ~\cdot~ \log^{1/2} T}{\varepsilon n} \right)^{1/2} \) <br /> [<a href="https://guyrothblum.files.wordpress.com/2014/11/hr10.pdf">HR10</a>, <a href="https://arxiv.org/abs/1107.3731">GRU12</a>]</td>
</tr>
</tbody>
</table>
<p>The bounds for approximate DP are known to be tight [<a href="https://arxiv.org/abs/1311.3158">BUV14</a>]. Our two open problems both involve improving the best known upper bounds for pure differential privacy.</p>
<blockquote>
<p><b>Open Problem 1:</b> What is the best possible \(\ell_\infty\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the known upper bound in the table can be improved to
\[
\alpha = \left(\frac{\log k \cdot \log T}{\varepsilon n} \right)^{1/2},
\]
which is known to be the best possible [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1].</p>
<blockquote>
<p><b>Open Problem 2:</b> What is the best possible \(\ell_2\)-error for answering a worst-case set of \(k\) statistical queries over a domain of size \(T\) subject to \((\varepsilon,0)\)-differential privacy?</p>
</blockquote>
<p>We conjecture that the upper bound can be improved to
\[
\alpha = \left(\frac{\log T}{\varepsilon n} \right)^{1/2}.
\]
The construction used in [<a href="https://dataspace.princeton.edu/handle/88435/dsp01vq27zn422">Har11</a>, Theorem 4.5.1] can be analyzed to show this bound would be tight. Note, in particular, that this conjecture implies that the tight upper bound has no dependence on the number of queries, similarly to the case of \(\ell_2\) error and approximate DP.</p>
Sasho NikolovJonathan UllmanWed, 07 Jul 2021 13:45:00 -0400
https://differentialprivacy.org/open-problem-optimal-query-release/
https://differentialprivacy.org/open-problem-optimal-query-release/