Differential Privacy

Tight RDP & zCDP Bounds from Pure DP

Mon, 27 May 2024 10:00:00 -0700

There are multiple ways to quantify differential privacy, including pure DP [DMNS06], approximate DP [DKMMN06], Concentrated DP [DR16,BS16], Rényi DP [M17], Gaussian DP [DRS19], & function-DP [DRS19]. Fortunately, these definitions are similar enough that we can convert between most of them (with some loss in parameters).

In this post, we consider converting from pure DP to Rényi DP and Concentrated DP. In particular, we will provide optimal results, which are an improvement on what is currently in the literature. But first, let’s recap the relevant definitions.

Definitions: Pure DP, Rényi DP, & zCDP

For notational simplicity, we will assume the output space of the algorithms is discrete and that the algorithms’ output distributions have full support.¹

Definition 1 (Pure DP): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\varepsilon\)-differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \le \varepsilon.\]

Pure DP is the simplest (and first) definition and is very convenient for analysis. Pure DP can also be called pointwise DP because the guarantee holds for all points \(y\), whereas all the other definitions either bound some quantity averaged over \(y\) or quantify over sets \(S \subseteq \mathcal{Y}\).

Definition 2 (Rényi DP): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \((\alpha,\widehat\varepsilon)\)-Rényi differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[ \frac{1}{\alpha-1} \log \left( \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \right) \le \widehat\varepsilon.\]

Rényi DP is a more flexible definition than pure DP. But this flexibility comes at the cost of complexity. The definition has two parameters, but we can usually trade off these parameters. Thus it is often better to think of it as being parameterized by a function \(\widehat\varepsilon(\alpha)\), which gives us a \((\alpha,\widehat\varepsilon(\alpha))\)-RDP bound for all \(\alpha>1\) simultaneously. However, in many cases – such as the Gaussian mechanism – the function is linear, or can be bounded by a linear function.

Definition 3 (zero-Concentrated DP (zCDP)): A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-zCDP if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, we have \[ \forall \alpha > 1 ~~~~~ \frac{1}{\alpha-1} \log \left( \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \right) \le \alpha\rho.\]

This definition is equivalent to satisfying \((\alpha,\rho\alpha)\)-RDP for all \(\alpha>1\); zCDP can be thought of as a single-parameter version of RDP, which gives us many of the benefits of RDP without the complexity.

Converting Pure DP to Rényi DP

It is immediate from the definitions that \(\varepsilon\)-DP implies \((\alpha,\varepsilon)\)-RDP for all \(\alpha>1\).² This is just saying that the average value is at most the maximum value. We can do better than this:

Theorem 4 (Pure DP to Rényi DP): Let \(M : \mathcal{X}^n \to \mathcal{Y}\) be a randomized algorithm satisfying \(\varepsilon\)-differential privacy. Then \(M\) satisfies \((\alpha,\widehat\varepsilon(\alpha))\)-Rényi DP for all \(\alpha>1\), where \[ \widehat\varepsilon(\alpha) = \frac{1}{\alpha-1} \log \left( \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} \right) \]\[ = \varepsilon - \frac{1}{\alpha-1} \log \left( \frac{1+e^{-\varepsilon}}{1 + e^{-(2\alpha-1)\varepsilon}} \right). \] Furthermore, this bound is tight.

Proof.³ Fix neighbouring inputs \(x, x’ \in \mathcal{X}^n\) and fix \(\alpha>1\).

First note that this bound is tight when \(M\) corresponds to randomized response. That is, if \(M(x) = \mathsf{Bernoulli}(\tfrac{e^\varepsilon}{e^\varepsilon+1})\) and \(M(x’) = \mathsf{Bernoulli}(\tfrac{1}{e^\varepsilon+1})\), then the expression in the theorem statement is simply the expression in the definition of Rényi DP. Since this is consistent with \(M\) satisfying \(\varepsilon\)-DP, this proves tightness of the result. To prove the result it only remains to show that randomized response is indeed the worst case \(M\).

We make two additional observations: (1) The definition of pure DP implies \( \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon \) for all \(y \in \mathcal{Y}\). But the definition of pure DP is symmetric in \(x\) and \(x’\), so we can swap them and obtain a two-sided bound: \[ \forall y \in \mathcal{Y} ~~~~~ e^{-\varepsilon} \le \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon.\] (2) Since \(\sum_y \mathbb{P}[M(x)=y] = 1\), we have \[ \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right] = \sum_y \mathbb{P}[M(x’)=y] \cdot \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = 1. \]

Now we define a randomized rounding function \(A : [e^{-\varepsilon},e^\varepsilon] \to \{e^{-\varepsilon},e^\varepsilon\}\) by \(\mathbb{E}_A [A(z)] = z \). That is, for all \( z \in [e^{-\varepsilon},e^\varepsilon] \), we have \[\underset{A}{\mathbb{P}}[A(z)=e^\varepsilon]=\frac{z-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} ~~~ \text{ and } ~~~ \underset{A}{\mathbb{P}}[A(z)=e^{-\varepsilon}]=\frac{e^\varepsilon-z}{e^\varepsilon-e^{-\varepsilon}}.\] Since \( v \mapsto v^\alpha \) is convex, by Jensen’s inequality, for all \( z \in [e^{-\varepsilon},e^\varepsilon] \), we have \[z^\alpha = \mathbb{E}_A[A(z)]^\alpha \le \mathbb{E}_A[A(z)^\alpha] = \frac{z-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-z}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon}. \] Applying this inequality to the quantity of interest with \(z = \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \), we get \[ \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \underset{Y \gets M(x’) }{\mathbb{E}}\left[ \frac{\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon} \right] .\] Observation 1 tells us that this is valid, since \(z \in [e^{-\varepsilon},e^\varepsilon]\). Observation 2 and linearity of expectations gives \[ \underset{Y \gets M(x’) }{\mathbb{E}}\left[ \frac{\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]}}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon} \right] = \frac{1-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} \cdot e^{\varepsilon\alpha} + \frac{e^\varepsilon-1}{e^\varepsilon-e^{-\varepsilon}} e^{-\alpha\varepsilon}.\] We have \(\frac{1-e^{-\varepsilon}}{e^\varepsilon-e^{-\varepsilon}} = \frac{e^\varepsilon-1}{e^{2\varepsilon}-1} = \frac{e^\varepsilon-1}{(e^\varepsilon-1)(e^\varepsilon+1)} = \frac{1}{e^\varepsilon+1}\) and, similarly,\(\frac{e^\varepsilon-1}{e^\varepsilon-e^{-\varepsilon}} = \frac{e^\varepsilon}{e^\varepsilon+1}\). Combining the equalities and inequalities gives \[ e^{(\alpha-1)\widehat\varepsilon(\alpha)} = \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \frac{1}{e^\varepsilon+1} e^{\alpha\varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha\varepsilon},\] which establishes the result. The equivalence of the two expressions in the theorem statement is a matter of algebraic manipulation; the second expression is more suitable for numerical computation. ∎

Converting Pure DP to zCDP

The RDP bound in Theorem 4 is tight, but a bit unwieldy. Now we look at zCDP bounds, which are looser but simpler. The trivial bound gives that \(\varepsilon\)-DP implies \(\varepsilon\)-zCDP. In a previous post we proved that \(\varepsilon\)-DP implies \(\frac12\varepsilon^2\)-zCDP.⁴ Now we prove a tight bound:

Theorem 5 (Pure DP to zCDP): Let \(M : \mathcal{X}^n \to \mathcal{Y}\) be a randomized algorithm satisfying \(\varepsilon\)-differential privacy. Then \(M\) satisfies \(\rho\)-zCDP for all \(\alpha>1\), where \[ \rho = \frac{e^\varepsilon-1}{e^\varepsilon+1} \varepsilon \le \frac12 \varepsilon^2. \] Furthermore, this bound is tight.

To prove this result, we use the following result, which is a tighter version of Hoeffding’s lemma.

Proposition 6 (Kearns-Saul inequality [KS13,BK13,AMN19]): For all \(p \in [0,1]\) and all \(t\in\mathbb{R}\), we have \[1-p + p \cdot e^t \le \exp\left(t \cdot p + t^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right).\]

Proof of Theorem 5. By Theorem 4, \(M\) satisfies \((\alpha,\widehat\varepsilon(\alpha))\)-Rényi DP for all \(\alpha>1\), where \[ e^{(\alpha-1)\widehat\varepsilon(\alpha)} = \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} .\] We need to show \(\widehat\varepsilon(\alpha) \le \rho\alpha\) for all \(\alpha>1\). Fix \(\alpha>1\).

Let \(p = \tfrac{1}{e^\varepsilon+1}\). Then \[ \frac{1}{e^\varepsilon+1} e^{\alpha \varepsilon} + \frac{e^\varepsilon}{e^\varepsilon+1} e^{-\alpha \varepsilon} = e^{-\alpha\varepsilon} \cdot \left( 1-p + p e^{2\alpha\varepsilon} \right) .\] By the Kearns-Saul inequality, \[ e^{-\alpha\varepsilon} \cdot \left( 1-p + p e^{2\alpha\varepsilon} \right) \le \exp\left((2p-1)\alpha\varepsilon + ( 2 \alpha \varepsilon)^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right) .\] Since \(2p-1 = - \tfrac{e^\varepsilon-1}{e^\varepsilon + 1}\) and \( \frac{1-p}{p} = e^\varepsilon \), this simplifies to \[ \exp\left((2p-1)\alpha\varepsilon + ( 2 \alpha \varepsilon)^2 \cdot \frac{1-2p}{4\log((1-p)/p)}\right) = \exp\left( -\alpha\varepsilon\frac{e^\varepsilon-1}{e^\varepsilon+1} + 4 \alpha^2 \varepsilon^2 \frac{\frac{e^\varepsilon-1}{e^\varepsilon+1}}{4\varepsilon} \right)\]\[ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ = \exp\left( (\alpha-1) \alpha \varepsilon \frac{e^\varepsilon-1}{e^\varepsilon+1} \right). \] Combining the inequalities yields \( \widehat\varepsilon(\alpha) \le \alpha \varepsilon \frac{e^\varepsilon-1}{e^\varepsilon+1} \), which gives the result.

Tightness is witnessed by randomized response and by taking the limit \(\alpha \to 1\). ∎

Numerical Comparison

Let’s see what these improved bounds look like:

This first plot compares the tight Rényi DP bound from Theorem 4 (solid line) with the trivial bound (\(\widehat\varepsilon(\alpha)\le\varepsilon\), dotted line) and the bound implied by zCDP (\(\widehat\varepsilon(\alpha)\le\alpha\rho\), dashed line) via Theorem 5. We consider \(\varepsilon=\frac12\) (red lines, bottom), \(\varepsilon=1\) (green lines, middle), and \(\varepsilon=2\) (blue lines, top).

We see that the trivial bound is tight as the Rényi order \(\alpha\) becomes large, while the zCDP bound is tight for small Rényi orders (i.e., \(\alpha\to1\)). The smaller \(\varepsilon\) is, the later this transition occurs.

This second plot compares the tight zCDP bound from Theorem 5 (solid magenta line) against the trivial bound (dotted yellow line) and the quadratic bound (dashed cyan line).

We see that, for small values of \(\varepsilon\), the quadratic bound is tight, while for large values of \(\varepsilon\), the trivial bound is tight.

Conclusion

In this post, we have given improved bounds for converting from pure DP to Rényi DP and zCDP. Numerically, these bounds are a modest improvement over the standard bounds.

The bounds are tight when the algorithm corresponds to randomized response. However, in many cases we can prove better bounds for specific algorithms. For example, in a previous post, we proved better zCDP bounds for the exponential mechanism.

Another popular pure DP mechanism is Laplace noise addition. Mironov [M17, Proposition 6] computed a tight Rényi DP bound specifically for the Laplace mechanism: Adding Laplace noise with scale \(1/\varepsilon\) to a sensitivity-1 function guarantees \(\varepsilon\)-DP and also \((\alpha,\widehat\varepsilon_{\text{Lap}}(\alpha))\)-RDP for all \(\alpha>1\) and \[\widehat\varepsilon_{\text{Lap}}(\alpha) = \frac{1}{\alpha-1}\log\left( \frac{\alpha}{2\alpha-1} e^{(\alpha-1)\varepsilon} + \frac{\alpha-1}{2\alpha-1} e^{-\alpha\varepsilon} \right).\]

Acknowledgements

Thanks to Damien Desfontaines for prompting this post. To the best of my knowledge this improved conversion first appeared in a Tweet by Yu-Xiang Wang.

In general, we can replace \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\) with the Radon-Nikodym derivative of the probability distribution given by \(M(x)\) with respect to the probability distribution given by \(M(x’)\) evaluated at \(y\). If the output distributions do not have full support, we must handle division by zero; to do this we take \(\frac{0}{0} = 1\) and \(\frac{\eta}{0} = \infty\) for \(\eta>0\). ↩
To be more precise, we have \[\underset{Y \gets M(x’)}{\mathbb{E}}\left[ \left( \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right)^\alpha \right] \le \underset{Y \gets M(x’)}{\mathbb{E}}\left[ \frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \right] \cdot \max_y \left( \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \right)^{\alpha-1} \le 1 \cdot \left( e^\varepsilon \right)^{\alpha-1},\] which yields the trivial conversion. Here we use Observation 2 from the proof of Theorem 4. ↩
This proof technique is due to Bun & Steinke [BS16, Proposition 3.3]. ↩
Bun & Steinke [BS16, Proposition 3.3] first established this bound, although with a more involved proof. Earlier papers [DRV10,DR16] proved slightly weaker bounds. ↩

NeurIPS 2023 Outstanding Paper: Privacy auditing in just one run

Tue, 02 Jan 2024 12:00:00 -0400

NeurIPS 2023 just wrapped up, and one of the two outstanding paper awards went to Privacy Auditing with One (1) Training Run, by Thomas Steinke, Milad Nasr, and Matthew Jagielski. The main result of this paper is a method for auditing the (differential) privacy guarantees of an algorithm, but much faster and more practically than previous methods. In this post, we’ll dive into what this all means.

In case you’re new to this: by now, it has been well established that ML models can leak information about their training data. This has recently been demonstrated in a spectacular fashion for large language models and diffusion models, showing that these models are prone to regurgitating elements from their training dataset verbatim. Beyond these models, training data leakage can occur to a variety of degrees in other statistical settings. This can of course be problematic if the training data contains sensitive personal information that we do not wish to disclose. It may may also be relevant to other adjacent considerations, including copyright infringement, which we don’t delve into here.

While there have been a number of heuristic proposals for how to deal with such problems, only one method has stood the test of time: differential privacy (DP). Roughly speaking, an algorithm (e.g., a model’s training procedure) is differentially private if its output has limited dependence (in some precise sense) on any single datapoint. This has many convenient implications: if a training procedure is differentially private, the resulting model is very unlikely to spit out training data, it is hard to predict whether a particular datapoint was in its training dataset, etc. This strong notion of privacy has been adopted by a number of organizations, including Google, Microsoft, and the US Census Bureau in the 2020 US Census. Differential privacy is a quantitative guarantee, parameterized by a value \(\varepsilon \geq 0\): the smaller \(\varepsilon\) is, the stronger the privacy protection (albeit at the cost of utility).

In order to say an algorithm is differentially private, we have to prove it. By analyzing the algorithm, we obtain an upper bound on the value of \(\varepsilon\), i.e., a guarantee that the algorithm satsfies at least some prescribed level of privacy. And we can be confident in this guarantee without running a single line of code! A rich line of work studies a differentially private analogue of stochastic gradient descent (which includes per-example gradient clipping followed by Gaussian noise addition), providing tighter and tighter upper bounds on the value of \(\varepsilon\).

Is there any way to empirically audit the privacy of an algorithm? Provided a purportedly private procedure, is there an algorithm we can run to lower bound the value of \(\varepsilon\)? This would discover that the procedure enjoys privacy no better than some particular level. There’s many reasons one might want to audit an algorithm’s privacy guarantees:

We can see if our privacy proof is tight: if we prove and audit matching values of \(\varepsilon\), then we know that neither can be improved.
We can see if our privacy proof is wrong: if we audit a value of \(\varepsilon\) that is greater than the value we prove, then we know there was a bug in our privacy proof.
If we’re unable to rigorously prove an algorithm is private, auditing gives some heuristic measure of how private the algorithm is (though this is not considered best practice in settings where privacy is paramount: auditing only lower bounds \(\varepsilon\), the true value may be much higher).

There is a long line of work on this question from the perspective of membership inference attacks. In a membership inference attack, we consider training a model on either a) some training dataset, or b) the same training dataset but with the inclusion of one extra datapoint (sometimes called a canary). If we can correctly guess whether the canary was or was not in the training set, then we say the membership inference attack was successful. However, recall that differential privacy limits the dependence on individual datapoints: if an algorithm is private, it means that membership inference attacks should not be very successful. Conversely, if an attack is very successful, then it say the algorithm is quantitatively not so private. In other words, such membership inference attacks serve as an auditing for the privacy of the algorithm.

An important technical point is that differential privacy is a probabilistic guarantee. A single membership inference attack success or failure may happen by chance: in order to make conclusions about the privacy level of a procedure, we need to run the attack several times in order to estimate the rate of success. Since for machine learning models, each attack corresponds to one training run, this can quickly result in prohibitive overheads. As one extreme example, one work trains 250,000 models to audit a proposed private training algorithm, revealing a bug in its privacy proof. While these are small models (CNNs trained on MNIST), and the authors admit their auditing was overkill (they only needed to train 1,000 models), in modern settings, even a single extra training run is prohibitively expensive, thus rendering such privacy auditing methods impractical.

Here’s where the work of Steinke, Nasr, and Jagielski comes in: it performs privacy auditing with just one (1) training run. This could even be the same as your actual training run, thus incurring minimal overhead with respect to the standard training pipeline. Their method does this by randomly inserting multiple canaries into the dataset rather than just a single one, and privacy is audited by trying to guess which canaries were and were not trained on. If one can correctly guess the status of many canaries, this implies that the procedure is not very private. The analysis of this framework is the tricky part, and gets quite technical. While textbook analysis of the addition/removal of multiple canaries would rely on a property of differential privacy known as “group privacy,” this turns out to be lossy. Instead, the authors appeal to connections between differential privacy and generalization: they show that if you add multiple canaries i.i.d. for a single run, this behaves similarly to having multiple runs each with a single canary.

In short, this work is a breakthrough in privacy auditing. It allows us to substantially reduce the computational overhead, from prohibitive to essentially negligible. Up to this point, privacy auditing has mostly been employed by those with a surplus of compute: I’m excited to see how this work will make it more accessible to the GPU-poor. Congratulations to Thomas, Milad, and Matthew on their fantastic result!

Open problem(s) - How generic can composition results be?

Mon, 18 Sep 2023 21:00:00 -0400

The composition theorem is a cornerstone of differential privacy literature. In its most basic formulation, it states that if two mechanisms \(\mathcal{M}_1\) and \(\mathcal{M}_2\) are respectively \(\varepsilon_1\)-DP and \(\varepsilon_2\)-DP, then the mechanism \(\mathcal{M}\) defined by \(\mathcal{M}(D)=\left(\mathcal{M}_1(D),\mathcal{M}_2(D)\right)\) is \((\varepsilon_1+\varepsilon_2)\)-DP. A large body of work focused on proving extensions of this composition theorem. These extensions are of two kinds.

Some composition results apply to different settings than fixed mechanisms.
Other extend known results to variants of differential privacy.

In this blog post, we review existing results, and outline natural open questions appearing on both fronts. We stumbled upon these open questions while building general-purpose differential privacy infrastructure, and we believe that solving them could have a positive impact on the usability and privacy/accuracy trade-offs provided by such tools.

Different settings for composition

First, let’s discuss what it means to compose two DP mechanisms.

Sequential composition

In the original composition result [DMNS06], all mechanisms \(\mathcal{M}_1\), \(\mathcal{M}_2\), etc., are fixed in advance, and have a predetermined privacy budget (resp. \(\varepsilon_1\), \(\varepsilon_2\), etc.). They only take the sensitive data \(D\) as input: \(\mathcal{M}_2\) cannot see nor depend on \(\mathcal{M}_1(D)\). This setting is typically called sequential composition.

Adaptive composition

Shortly afterwards, the result was extended to a setting called adaptive composition [DKMMN06]. In this context, each mechanism can access the outputs of previous mechanisms: for example, \(\mathcal{M}_2\) takes as input not only the sensitive data \(D\), but also \(\mathcal{M}_1(D)\). However, the privacy budget associated with each mechanism is still fixed.

Fully adaptive composition

A natural extension of adaptive composition consists in allowing the privacy budget of each mechanism to depend on previous outputs. This setting is called fully adaptive composition [RRUV16]. It captures a setting in which a single analyst is interacting with a DP interface, and can change which queries to run and their budget based on past results.

Composition theorems in the fully adaptive setting are of two types.

Privacy filters assume that the DP interface has a fixed, total budget, and will refuse to answer queries once that budget is exhausted.
Privacy odometers, by contrast, allow the analyst to run arbitrarily many queries using as much budget as they want, and quantify the privacy loss over time.

Somewhat surprisingly, there are separation results between both types: one can obtain tighter composition theorems with privacy filters than privacy odometers.

Concurrent composition

This is, however, not the end of the story. Fully adaptive composition captures a setting in which a single analyst interacts with a DP interface. What if multiple analysts have access to this interface, each with their own budget? Concurrent composition [VW21] captures this idea. In this setting, the mechanisms that are being composed are interactive (we denote them by IM in the diagram below), and the analysts interacting with each mechanism can share results with each other, and adaptively decide which queries to run. The goal is to quantify the total privacy budget cost, across analysts: do existing results extend to the composition of interactive mechanisms?

Fully concurrent composition?

In concurrent composition as defined in [VW21], the number of analysts and their respective privacy budget is fixed upfront. This means that concurrent composition and fully adaptive composition results are incomparable. This suggests an even more generic setting, which (to the best of our knowledge) has not been studied in the literature: a kind of concurrent composition, where the number of analysts and their budget is not predefined. Let’s call this fully concurrent composition. In this setting, an analyst with a certain privacy budget would be able to spin off a new interactive mechanism, with an adaptively-chosen privacy budget, that can also be interacted with concurrently.

This setting might seem pointless — why would analysts want to do this? — but proving composition results in this context would help building DP interfaces that combine expressivity and conceptual simplicity. To understand why, let’s take a look at how Tumult Analytics¹ allows users to use its parallel composition feature.

Tumult Analytics has a concept of a Session, which is initialized on some sensitive data with a given privacy budget. Users can submit queries to this Session using a query language implemented in Python. Each query executed by the Session will consume part of the overall privacy budget, and return DP results. The use can then examine these results to decide which queries to submit to the Session next, and with which privacy budget. So far, this matches the fully adaptive setting, in its privacy filter formulation.

But Tumult Analytics also allows users to split their sensitive data depending on the value of an attribute, and perform different operations in each partition of the data. With this feature, users can write algorithms that use parallel composition, which is very useful. This partitioning operation takes a fraction of the privacy budget, and spins off sub-Sessions that each have access to a subset of the original data. The following diagram visualizes an example of this process.

At the beginning, there is one Session with a privacy budget of \(\varepsilon_1=3\). After the partitioning operation, there are now three Sessions: the original Session that has access to all the data and has a leftover privacy budget of \(\varepsilon_1=2\), and two sub-Sessions that each have access to a partition of the data and have a privacy budget of \(\varepsilon_2=1\). The analyst using this interface can interact with any of these three Sessions, and interleave queries between each, in a fully interactive manner. This means that even though there is a single user interacting with the data, the setting is similar to concurrent composition: each Session is an interactive object with a maximum privacy budget. However, note that the privacy budget associated with each of the sub-Sessions could, in principle, depend on the result of past queries. This suggests that we need composition results that take this into account, and capture the fully concurrent setting suggested above.

Composition for variants of differential privacy

Existing results and natural questions

A large number of variants and extensions of differential privacy have been proposed in the literature. In many cases, a benefit of these alternative definitions is to improve the privacy analysis of mechanisms that compose a large number of simpler primitives. For example, the \(n\)-fold composition of \(\varepsilon\)-DP mechanisms is \(n\varepsilon\)-DP, but the \(n\)-fold composition of \((\varepsilon,\delta)\)-DP mechanisms is also \((\varepsilon’,\delta’)\)-DP, with \(\varepsilon’\approx\sqrt{n}\varepsilon\) and \(\delta’\approx n\delta\). Machine learning applications often use the moments accountant to perform privacy accounting, relying on the composition property of Rényi DP [Mir17, ACGMMTZ16]. Gaussian DP and its generalization \(f\)-DP [DRS22] are also used in this context [BDLS20]. Meanwhile, statistical use cases using the Gaussian mechanism often use zero-concentrated DP [BS16] (zCDP) for their privacy analysis [Des21]; the approximate version of this definition is also useful when queries are grouped by an unknown domain [SDH23].

It is thus natural to study the composition of these variants under the settings described in the previous section. For many variants and composition settings, optimal composition results have been proven. We give an overview in the following table.

	Sequential	Adaptive	Fully adaptive	Concurrent
\(\varepsilon\)-DP	[DMNS06]	[DKMMN06]	[RRUV16]	[VW21]
\((\varepsilon,\delta)\)-DP	[KOV15]	[KOV15]	[WRRW22]*	[WRRW22, Lyu22]
Gaussian DP	[DRS22]	[DRS22]	[ST22]	[VZ22]
\(f\)-DP	[DRS22]	[DRS22]		[VZ22]
\((\alpha,\varepsilon)\)-Rényi DP	[Mir17]	[Mir17]	[FZ21]	[Lyu22]
\(\rho\)-zero-concentrated DP	[BS16]	[BS16]	[FZ21]	[Lyu22]
\(\delta\)-approx. \(\rho\)-zCDP	[BS16]	[BS16]	[WRRW22]

* Only asymptotically optimal for small ε.

This summary already suggests a few natural open questions: it is not known whether the fully adaptive composition results for \((\varepsilon,\delta)\)-DP can be improved, there is no fully adaptive composition theorem for \(f\)-DP, or concurrent for \((\rho,\delta)\)-approximate zCDP.

Reordering mechanisms during the privacy analysis

Let’s assume for a moment that the table above is completed, and that we have optimal composition theorems for all the variants of interest and all settings. Consider an analyst using a differential privacy framework, and performing multiple operations in a fully adaptive way. Some of these operations are using \(\rho\)-zCDP, others are \((\varepsilon,\delta)\)-DP, alternatively, with varying parameters. How should the privacy accounting be done in such a scenario?

In the context of sequential composition, it would be natural to reorder those mechanisms: consider the equivalent situation where all \(\rho\)-zCDP mechanisms occur first, and all \((\varepsilon,\delta)\)-DP mechanisms occur afterwards. In this setting, the zCDP mechanisms can be first be composed using the zCDP composition rule. The overall zCDP guarantee can then be converted to \((\varepsilon,\delta)\)-DP, and composed with the other \((\varepsilon,\delta)\)-DP guarantees. This will lead to a tighter privacy analysis than converting every individual \(\rho\)-zCDP mechanism to \((\varepsilon,\delta)\)-DP, and composing those guarantees.

However, we would need an additional theoretical result to perform this kind of reordering operation in a fully adaptive context: the fact that composition results exist for \((\varepsilon,\delta)\)-DP and \(\rho\)-zCDP does not mean they can be combined. How to resolve this problem, and make it possible to use the same privacy accounting techniques in the sequential setting and in the fully adaptive or fully concurrent setting? This leads to a natural open question: when performing the privacy analysis of a privacy filter, can one “reorder” the mechanisms when composing them? Answering this positively would allow DP frameworks to implement tighter privacy accounting at a relatively low cost in complexity. It might very well be that the answer to this open question is negative. In that case, proving such a separation result would be of significant theoretical interest in the study of DP composition.

Composing privacy loss distributions

When we say that a mechanism is \((\varepsilon,\delta)\)-DP, or \(\rho\)-zCDP, we are giving a “global” bound on the privacy loss random variable, defined by: \[ \mathcal{L}_{D,D’}(o) = \ln\left(\frac{\mathbb{P}\left[\mathcal{M}(D)=o\right]}{\mathbb{P}\left[\mathcal{M}(D’)=o\right]}\right) \] for all neighboring inputs \(D\) and \(D’\).

An alternative approach to privacy accounting consists in fully describing this random variable. One approach to do this uses the formalism of privacy loss distributions (PLDs) [SMM18]. The PLD of a mechanism is defined as: \[ \omega(y) = \mathbb{P}_{o\sim\mathcal{M}(D)}\left[\mathcal{L}_{D,D’}(o)=y\right]. \]

In the sequential composition setting, PLDs can be used for tight privacy analysis. This relies on a conceptually simple result: if \(\omega\) is the PLD of \(\mathcal{M}\) and \(\omega’\) is the PLD of \(\mathcal{M}’\) on neighboring databases \(D\), \(D’\), then the PLD of the composition of \(\mathcal{M}\) and \(\mathcal{M}’\) is \(\omega\ast\omega’\), where \(\ast\) is the convolution operator. Of course, when doing privacy accounting, we don’t want \(\omega\) and \(\omega’\) to depend on the pair of databases, so we replace them by worst-case PLDs, that are “larger” than all possible PLDs for neighboring databases.

Using PLDs for privacy accounting can be done numerically [MM18, KJH20, KJPH21, GLW21, GKKM22, DGKKM22] or analytically [ZDW22]. This family of approaches is convenient because it is very generic: DP frameworks can use a tight upper bound PLD when known, and fall back to a worst-case PLD corresponding to \(\varepsilon\)-DP or \((\varepsilon,\delta)\)-DP when the mechanism is too complex. Unfortunately, the composition result mentioned above has only been proven in the sequential composition setting [MM18]. Extending it to adaptive composition is straightforward, but extending it to the fully adaptive setting (with privacy filters) or the concurrent setting does not seem trivial.

This leads us to our last open question: can these privacy accounting techniques be used in the fully adaptive or concurrent settings?

Summary

In this blog post, we gave a high-level overview of different settings and variants of composition theorems. Along the way, we listed a number of natural open questions.

Can we define a setting that generalizes both fully adaptive composition and concurrent composition? What composition results hold in that setting?
Can we “fill in the blanks” among existing composition results? Namely, can we prove optimal composition results for \((\varepsilon,\delta)\)-DP and \(f\)-DP in the fully adaptive setting, and for \((\varepsilon,\delta)\)-approximate zCDP in the concurrent setting?
In the fully adaptive setting with privacy filters, can one reorder mechanisms when computing their cumulative privacy loss, to optimize the privacy accounting?
Can we prove fully adaptive and concurrent composition results for privacy accounting based on privacy loss distributions?

Progress on these open questions would either uncover surprising additional separation results, or enable usability and utility improvements to general-purpose DP infrastructure. We’re excited about both prospects!

Tumult Analytics is a differential privacy framework used by institutions such as the U.S. Census Bureau, the IRS, or the Wikimedia Foundation. It is developed by Tumult Labs, the employer of the author of this blog post. ↩

Beyond Local Sensitivity via Down Sensitivity

Tue, 12 Sep 2023 10:00:00 -0700

In our previous post, we discussed local sensitivity and how we can get accuracy guarantees that scale with local sensitivity, which can be much better than the global sensitivity guarantees attained via standard noise addition mechanisms. In this post, we will look at what we can do when even the local sensitivity is unbounded. This is obviously a challenging setting, but it turns out that not all hope is lost.

As a motivating example, suppose we have a dataset \(x=(x_1,x_2,\cdots,x_n)\) and we want to approximate \(\max_i x_i \) in a differentially private manner. The difficulty is that adding a single element to \(x\) can increase the maximum arbitrarily. That is, if \(x’=(x_1,x_2,\cdots,x_n,\infty)\), then \(\max_i x’_i=\infty\). Differential privacy requires us to make the outputs \(M(x)\) and \(M(x’)\) indistinguishable, which seems to directly contradict our accuracy goal \(M(x) \approx \max_i x_i\).

One solution to the problem of unbounded sensitivity is to clip the inputs, so that the sensitivity becomes bounded. But this requires knowing a good a priori approximate upper bound on the \(x_i\)s. Trying to find such an upper bound is probably the very reason we want to approximate the maximum in the first place!

Another solution is to “aim lower:” Instead of aiming to approximate the largest element \(x_{(n)} := \max_i x_i\), we can aim to approximate the \(k\)-th largest element \(x_{(n-k+1)}\). The \(k\)-th largest element has bounded local sensitivity, which means we can apply the inverse sensitivity mechanism or similar tools. And – spoiler alert – this is essentially what we will do. However, we will present an algorithm that is more general than just for approximating the maximum.

The algorithm we present is due to Fang, Dong, and Yi [FDY22]. In terms of applications, a natural setting where we may need to approximate functions of unbouned local sensitivity is when each person can contribute multiple items to the dataset. This setting is often referred to as “user-level differential privacy” or “user DP.”¹ For example, if we have a collection of web browsing histories, we may wish to estimate the total number of webpages visited; this has unbounded local sensitivity because a single person could visit an arbitrary number of webpages.

Down Sensitivity

Observe that, while adding one element to the input can increase the maximum arbitrarily, removing one element can only decrease it by the gap between the largest and second-largest elements \(x_{(n)}-x_{(n-1)}\). In other words, the maximum satisfies some kind of one-sided local sensitivity bound. This is the general property we will rely on.

We define the \(k\)-down sensitivity² of the function \(f : \mathcal{X}^* \to \mathbb{R}\) at the input \(x\in\mathcal{X}^*\) as \[\mathsf{DS}^k_f(x) := \sup_{x’ \subseteq x : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{1}\] Here \(\mathrm{dist} : \mathcal{X}^* \times \mathcal{X}^* \to \mathbb{R}\) is the size of the symmetric difference between the two input tuples/multisets \(\mathrm{dist}(x,x’) = |x \setminus x’| + | x’ \setminus x |\), which defines a metric. In other words, it measures how many people’s data must be added or removed to get from one dataset to the other. For comparison, the local sensitivity is \[\mathsf{LS}^k_f(x) := \sup_{x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{2}\] The difference between Equations 1 and 2 is simply that down sensitivity only considers removing elements from \(x\), while local sensitivity considers both addition and removal. Thus, the down sensitivity is at most the local sensitivity, which is, in turn, upper bounded by the global sensitivity: \(\mathsf{DS}^k_f(x) \le \mathsf{LS}^k_f(x) \le k \cdot \mathsf{GS}_f\).

Intuitively, what is nice about down sensitivity is that it only considers the actual data we have at hand. It doesn’t consider any hypothetical people’s data that could be added to the dataset. It is appealing to only have to deal with “real” data.

Our goal now is to estimate \(f(x)\) in a differentially private manner, where the accuracy guarantee scales with the down sensitivity.

Monotonicity Assumption

In order to do anything, we need some assumptions about the function \(f : \mathcal{X}^* \to \mathcal{Y}\) that we are trying to approximate. First we will assume that \(\mathcal{Y} \subseteq \mathbb{R}\) is finite and \(f\) is surjective.³ The main assumption is monotonicity: \[\forall x’ \subseteq x \in \mathcal{X}^* ~~~ f(x’) \le f(x). \tag{3}\] The maximum and many other example functions satisfy this assumption.

Intuitively, we need this assumption to ensure that the down sensitivity is well-behaved. Specifically, Lemma 1 below requires monotonicity.

As an example of what could happen if we don’t make this assumption, consider the function \(\mathrm{sum}(x) := \sum_i x_i\) and the pair of neighbouring inputs \(x=(1,1,\cdots,1)\in\mathcal{Y}^n,x’=(1,1,\cdots,1,-100n)\in\mathcal{Y}^{n+1}\). Then, for all \(1 \le k\le n\), we have \(\mathsf{DS}_{\mathrm{sum}}^k(x)=k\), but \(\mathsf{DS}_{\mathrm{sum}}^k(x’)=100n\).

Note that the sum is monotone if we restrict to non-negative inputs. In general, we can take any function \(g\) and convert it into a monotone function \(f\) by defining \(f(x) = \max\{ g(\check{x}) : \check{x} \subseteq x \}\). Depending on the context, this \(f\) may or may not be a good proxy for \(g\).

A Loss With Bounded Global Sensitivity

Given a monotone function \(f : \mathcal{X}^* \to \mathbb{R}\), we define a loss function \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) by \[\ell(x,y) := \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) \le y \}. \tag{4}\] In other words, \(\ell(x,y)\) measures how many entries of \(x\) we need to remove to decrease the function value until \(f(x) \le y\). Yet another way to think of it is that \(\ell(x,y)\) is the distance from the point \(x\) to the set \(f^{-1}((-\infty,y]) \cap \{ \tilde{x} : \tilde{x} \subseteq x \} \).

Figure 1: Visualization of the loss \(\ell(x,y)\) corresponding to \(f(x)=\max_i x_i\) for a dataset representing the distribution \(\mathrm{Binomial}(5,1/2)\) i.e. the true maximum is \(5\) and the dataset is \(x=(0,\underbrace{1,1,1,1,1}_{5\times},\underbrace{2,2,\cdots,2}_{10\times},\underbrace{3,3,\cdots,3}_{10\times},\underbrace{4,4,4,4,4}_{5\times},5)\).

The key property we need is that this loss has bounded sensitivity. We split the proof into Lemmas 1 and 2.

Lemma 1. Let \(f : \mathcal{X}^* \to \mathbb{R}\) satisfy the monotonicity property in Equation 3. Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Let \(x’ \subseteq x \in \mathcal{X}^*\). Then \(\ell(x’,y)\le\ell(x,y)\) for all \(y \in \mathbb{R}\).

Proof. Fix \(y \in \mathbb{R}\) and \(x’ \subseteq x \in \mathcal{X}^*\). Let \(x_\Delta = x \setminus x’ \subseteq x\), so that \(x’ = x \setminus x_\Delta \).

Let \(\widehat{x} \subseteq x\) satisfy \(f(\widehat{x})\le y\) and \(\mathrm{dist}(x,\widehat{x})=\ell(x,y)\). Define \(\widehat{x}’ = \widehat{x} \setminus x_\Delta\). This ensures \(\widehat{x}’ \subseteq x’\) and \[\mathrm{dist}(x’,\widehat{x}’) = \mathrm{dist}(x \setminus x_\Delta , \widehat{x} \setminus x_\Delta ) \le \mathrm{dist}(x,\widehat{x}).\]

By monotonicity, \(f(\widehat{x}’) \le f(\widehat{x}) \le y\). Thus \[\ell(x’,y) = \min\{ \mathrm{dist}(x’,\tilde{x}’) : \tilde{x}’ \subseteq x’, f(\tilde{x}’) \le y \}\]\[ \le \mathrm{dist}(x’,\widehat{x}’) \le \mathrm{dist}(x,\widehat{x}) = \ell(x,y).\] ∎

Lemma 2. Let \(f : \mathcal{X}^* \to \mathbb{R}\). Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Let \(x’ \subseteq x \in \mathcal{X}^*\). Then \(\ell(x,y)\le\ell(x’,y)+\mathrm{dist}(x,x’)\) for all \(y \in \mathbb{R}\).

Proof. Fix \(y \in \mathbb{R}\) and \(x’ \subseteq x \in \mathcal{X}^*\).

Let \(\widehat{x}’ \subseteq x’\) satisfy \(f(\widehat{x}’)\le y\) and \(\mathrm{dist}(x’,\widehat{x}’)=\ell(x’,y)\). Since \(\widehat{x}’ \subseteq x’ \subseteq x\), we have \[\ell(x,y) = \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) \le y \} \le \mathrm{dist}(x,\widehat{x}’) \]\[ \le \mathrm{dist}(x,x’) + \mathrm{dist}(x’,\widehat{x}’) = \ell(x,y)+\mathrm{dist}(x,x’),\] by the triangle inequality, as required. ∎

Note that we only needed the monotonicity assumption for Lemma 1. Combining the two lemmas gives \[ \forall x’ \subseteq x ~ \forall y ~~~~~ \ell(x’,y) \le \ell(x,y) \le \ell(x’,y) + \mathrm{dist}(x,x’).\] Overall we have the following guarantee.

Proposition 3. (Global Sensitivity of the Loss) Let \(f : \mathcal{X}^* \to \mathbb{R}\) satisfy the monotonicity property in Equation 3. Define \(\ell : \mathcal{X}^* \times \mathbb{R} \to \mathbb{Z}_{\ge 0}\) as in Equation 4.
Then, for all \(x, x’ \in \mathcal{X}^*\) and all \(y \in \mathbb{R}\), we have \[|\ell(x,y)-\ell(x’,y)| \le \mathrm{dist}(x,x’).\]

Proof. Fix \(x, x’ \in \mathcal{X}^*\) and \(y \in \mathbb{R}\). Let \(x’’ = x \cap x’\). Since \(x’’ \subset x’\) and \(f\) is assumed to be monotone, Lemma 1 gives \(\ell(x’’ ,y) \le \ell(x’,y)\). Also \(x’’ \subset x\), whence Lemma 2 gives \(\ell(x,y) \le \ell(x’’ , y) + \mathrm{dist}(x , x’’ )\). Note that \( \mathrm{dist}(x , x’’ ) = | x \setminus x’’ | = | x \setminus x’ | \le \mathrm{dist}(x , x’ ).\) Combining inequalities gives \(\ell(x,y) \le \ell(x’ , y) + \mathrm{dist}(x , x’ )\). The other direction is symmetric. ∎

The Shifted Inverse Sensitivity Mechanism

Let’s recap where we are: We have a monotone function \(f : \mathcal{X}^* \to \mathcal{Y}\), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. We want to approximate \(f(x)\) privately. Equation 4 gives us a loss \(\ell\) that is low-sensitivity. We have \(\ell(x,f(x))=0\) and, if \(y < f(x)\) decreases, the loss \(\ell(x,y)\) increases (depending on the down sensitivity of \(f\)). So far, so good. The problem is that if \(y > f(x)\) increases, the loss \(\ell(x,y)\) doesn’t increase. This means we can’t just throw this loss into the exponential mechanism.

Intuitively, the way we get around this problem is by looking for a value \(y\) such that the loss \(\ell(x,y)\) is greater than zero, but not too large. That is, we “shift” our goal from trying to minimize \(\ell(x,y)\) to minimizing something like \(|\ell(x,y)-\tau|\) for some integer \(\tau>0\). Going back to the example of the maximum, this corresponds to aiming for the \((\tau+1)\)-th largest value instead of the largest value. The hope is that we get an output with \(|\ell(x,y)-\tau|<\tau\), which for the maximum example corresponds roughly to getting a value between the largest value and the \(2\tau\)-th largest value.

Fang, Dong, and Yi [FDY22] directly apply the exponential mechanism [MT07] with a loss of the form \(|\ell(x,y)-\tau|\).⁴ This yields the following guarantee.

Theorem 4. (Shifted Inverse Sensitivity Mechanism) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\varepsilon>0\) and \(\beta \in (0,1)\). Then there exists an \(\varepsilon\)-differentially private \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge 1 - \beta,\] where \(\tau=\left\lceil\frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rceil\).

This is exactly the kind of guarantee we were aiming for; the accuracy scales with the down sensitivity, which could be much smaller than either the local sensitivity or the global sensitivity. Note that the guarantee gives an underestimate: \(M(x) \le f(x)\). This is inherent. If the function has infinite “up sensitivity,” then we cannot give an upper bound in a differentially private manner.

The shifted inverse sensitivity mechanism has the same limitations as the inverse sensitivity mechanism that we discussed in our previous post. Namely, computing the loss can be computationally intractable for general functions and we have a \(\log|\mathcal{Y}|\) dependence. (We will discuss how to improve this next.) An additional limitation is that we need the monotonicity assumption. But, as discussed earlier, down sensitivity behaves weirdly without this assumption.

Beyond the Exponential Mechanism

Applying the exponential mechanism to find \(y\) with \(\ell(x,y)\approx\tau\) yields a clean guarantee in Theorem 4. However, there are other methods we can apply which may be simpler⁴ and give better asymptotic guarantees.

Observe that the loss \(\ell(x,y)\) is a decreasing function of \(y\). The exponential mechanism does not exploit this structure. A very natural alternative algorithm is to perform binary search.⁵

We describe the algorithm in pseudocode and briefly analyze it: The input is the loss \(\ell\) defined in Equation 4, the dataset \(x\), an ordered enumeration of the set of outputs \(\mathcal{Y} = \{y_0 \le y_1 \le \cdots \le y_{|\mathcal{Y}|-1} \}\), and parameters \(\sigma,\tau>0\).

def noisy_binary_search(loss, x, Y, sigma, tau):
     i_min = 0
     i_max = len(Y) - 1
     while i_min + 1 < i_max:
          k = (i_min + i_max) // 2
          v = loss(x, Y[k]) + laplace(sigma)
          if v <= tau:
               i_max = k
          else:
               i_min = k
     return Y[i_max]

Since each iteration satisfies \(\frac1\sigma\)-differential privacy and there are at most \(\lceil \log_2 |\mathcal{Y}| \rceil-1\) iterations, the algorithm satisfies \(\varepsilon\)-differential privacy for \(\varepsilon = \frac{\log_2 |\mathcal{Y}|}{\sigma} \) by basic composition. Alternatively, using advanced composition, we see that the algorithm satisfies \(\rho\)-zCDP [BS16] for \(\rho = \frac{\log_2 |\mathcal{Y}|}{2\sigma^2} \).

By a union bound, each noise sample has magnitude at most \(\tau\) with probability at least \(1 - \exp(-\tau/\sigma) \cdot \log_2|\mathcal{Y}|\).⁶ Assuming the noise magnitudes are \(\le\tau\), the binary search maintains the invariants \(\ell(x,y_{i_\min})>0\) and \(\ell(x,y_{i_\max})\le 2\tau\). These invariants imply \(y_{i_\min} < f(x)\) and \(y_{i_\max} \ge f(x) - \mathsf{DS}_f^{2\tau}(x)\) respectively. At the end of the binary search, \(i_\min+1 \ge i_\max\) and thus \(y_{i_\min} < f(x)\) implies \(y_{i_\max} \le f(x)\).

Setting \(\tau = \sigma \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right)\) and \(\sigma = \frac{\log_2|\mathcal{Y}|}{\varepsilon}\) yields a result similar to Theorem 4.

Setting \(\tau = \sigma \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right)\) and \(\sigma = \sqrt{\frac{\log_2|\mathcal{Y}|}{2\rho}}\) yields the following result for concentrated differential privacy [DR16,BS16]. Note that setting \(\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}\) suffices to give \((\varepsilon,\delta)\)-differential privacy [e.g. S22 Remark 15].

Theorem 5. (Shifted Inverse Sensitivity Mechanism with Concentrated Differential Privacy) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\rho>0\) and \(\beta \in (0,1)\). Then there exists an \(\rho\)-zCDP \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge 1 - \beta,\] where \(\tau = \sqrt{\frac{\log_2|\mathcal{Y}|}{2\rho}} \cdot \log\left(\frac{\log_2|\mathcal{Y}|}{\beta}\right) \).

Comparing Theorems 4 and 5 we see an asymptotic improvement in the dependence on the size of the output space \(|\mathcal{Y}|\). (This improvement is the benefit of advanced composition.) Theorem 4 gives \(\tau = \Theta(\log|\mathcal{Y}|)\), while Theorem 5 gives \(\tau = \Theta(\sqrt{\log|\mathcal{Y}|} \cdot \log \log |\mathcal{Y}|)\).⁷ In exchange, Theorem 4 gives a pure differential privacy guarantee (i.e. \((\varepsilon,\delta)\)-DP with \(\delta=0\)), while Theorem 5 gives a concentrated differential privacy guarantee, which can be translated to approximate differential privacy (i.e. \((\varepsilon,\delta)\)-DP with \(\delta>0\)).

We can actually do even better than binary search! The problem we’re solving with binary search is actually an instance of the generalized interior point problem [BDRS18] (which is essentially the same as quasi-concave optimization [CLNSS23]). This problem and its variants have been extensively studied in the context of private learning [BNS13,BNSV15,etc.] The upshot is that, under \((\varepsilon,\delta)\)-differential privacy, we can achieve the same result as Theorems 4 and 5 with \(\tau = \frac{\log(1/\delta)}{\varepsilon} \cdot 2^{O(\log^* |\mathcal{Y}|)}\), where \(\log^*\) denotes the iterated logaritm.

Theorem 6. (Shifted Inverse Sensitivity Mechanism with Approximate Differential Privacy) Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be monotone (Equation 3), where \(\mathcal{Y} \subseteq \mathbb{R}\) is finite. Let \(\varepsilon>0\) and \(\delta \in (0,.1)\). Then there exists an \((\varepsilon,\delta)\)-differentially private \(M : \mathcal{X}^* \to \mathcal{Y}\) with the following accuracy guarantee. For all \(x \in \mathcal{X}^*\), we have \[\mathbb{P}\left[ f(x) \ge M(x) \ge f(x) - \mathsf{DS}_f^{2\tau}(x) \right] \ge \frac{9}{10},\] where \(\tau = \frac{\log(1/\delta)}{\varepsilon} \cdot 2^{O(\log^* |\mathcal{Y}|)}\).

The iterated logarithm is an unbelievably slow-growing function. Thus Theorem 6 improves on Theorems 4 and 5 in terms of the dependence on \(|\mathcal{Y}|\). However, the dependence on \(\delta\) is worse than Theorem 5 (\(\tau=\Theta(\log(1/\delta))\) versus \(\tau=\Theta(\sqrt{\log(1/\delta)})\)). (Theorem 4 achieves \(\delta=0\).)

Conclusion

In this post we’ve covered the shifted inverse sensitivity mechanism of Fang, Dong, and Yi [FDY22], as well as some extensions.

The key takeaway is that we can privately approximate a monotone function with error scaling with the down sensitivity. This is particularly interesting in settings where the local and global sensitivities are large. Down sensitivity is an appealing notion because it is entirely defined by the “real” dataset; its definition (Equation 1) does not consider hypothetical data items that aren’t in the dataset.

Fang, Dong, and Yi [FDY22] show that the shifted inverse sensitivity mechanism attains strong instance optimality guarantees. In other words, up to logarithmic factors, no differentially private mechanism can achieve better error guarantees.

We can view the shifted inverse sensitivity mechanism as a reduction. It reduces the task of approximating a monotone function to a problem akin to approximating the median. (More precisely, it reduces it to a generalized interior point problem.) We think this is a neat addition to the toolkit of differentially private algorithms

We emphasize that user-level differential privacy is not an alternative privacy definition, rather it is the standard definition of differential privacy with a data schema allowing multiple data items per person. In contrast, most of the differential privacy literature assumes a one-to-one correspondence between people and data items. Note that we prefer the terminology “person”/”people” rather than “user”/”users.” The “user” terminology is specific to the tech industry and may be confusing in other contexts; e.g., in the context of the US Census Bureau, “users” are the entities (such as government agencies) that use data provided by the bureau, rather than the people whose data the bureau collects. ↩
The name “down sensitivity” is due to Cummings and Durfee [CD20], who attribute the idea to Raskhodnikova and Smith [RS16]. The name local empirical sensitivity has also been used [CZ13]. The \(k\)-down sensitivity should not be confused with the down sensitivity at distance \(k\), which is defined by \(\mathsf{DS}_f^{(k)}(x) := \sup \{ \mathsf{DS}_f^1(x’) : \mathrm{dist}(x,x’) \le k \}\). Note that \(\mathsf{DS}_f^k(x) \le k \cdot \mathsf{DS}_f^{(k-1)}(x)\). ↩
The finiteness assumption can be relaxed somewhat, but we do need some kind of constraint on the output space to ensure utility. The surjectivity assumption simply ensures that the loss is always finite; alternatively we could allow the loss to take the value infinity. Note that we define \(\mathcal{X}^* := \bigcup_{n=0}^\infty \mathcal{X}^n\) to be the set of all finite tuples of elements in \(\mathcal{X}\); we use subset notation \(x’ \subseteq x \) to denote that \(x’\) can be obtained by removing elements from \(x\) (and potentially permuting). ↩
Alas, there is a technical issue we need to deal with in order to apply the exponential mechanism: The loss function is far from continuous, so there may not exist any \(y\) such that \(|\ell(x,y)-\tau|<\tau\). For example, computing the maximum of the dataset \(x=(1,1,\cdots,1)\) gives a loss function with \(\ell(x,y)=0\) for all \(y \ge 1\) and \(\ell(x,y)=n\) for all \(y < 1\); i.e., no \(y\) gives \(0<\ell(x,y)<n\). The way we fix this issue is as follows. Observe that we can decompose \(|\ell(x,y)-\tau|=\max\{\ell(x,y)-\tau,\tau-\ell(x,y)\}\). Now we define a slightly different loss function: \[\overline{\ell}(x,y) := \min\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x} \subseteq x, f(\tilde{x}) < y \}. \tag{A}\] Equation A defining \(\overline{\ell}(x,y)\) differs from Equation 4 defining \(\ell(x,y)\) only in that we replace “\(\le\)” with “\(<\)”. The modified loss \(\overline\ell\) still has low sensitivity; the proof is identical to that of Proposition 3. Now we can run the exponential mechanism with the loss \[\ell^*(x,y) := \max\{\ell(x,y)-\tau,\tau-\overline{\ell}(x,y)\}. \tag{B}\] This loss has low sensitivity and, for \(\hat{y} = \min\{f(\tilde{x}):\tilde{x}\subseteq x, \mathrm{dist}(x,\tilde{x})\le\tau\}\), we have \(\ell(x,\hat{y})\le\tau\) and \(\overline{\ell}(x,\hat{y})>\tau\), which implies \(\ell^*(x,\hat{y}) \le 0\). Thus we can use \(\ell^*(x,y)\) in place of \(|\ell(x,y)-\tau|\) to fix this technical issue. Setting \(\tau=\left\lceil\frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rceil\) and running the exponential mechanism with loss \(\ell^*\) yields Theorem 4. Specifically, the guarantee of the exponential mechanism is \(\mathbb{P}\left[ \ell^*(x,M(x)) < \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right]\ge 1-\beta\). Then \(\tau-\overline{\ell}(x,M(x)))< \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\) implies \(\overline{\ell}(x,M(x))>0\), which implies \(M(x)\le f(x)\). Similarly, \(\ell(x,M(x))-\tau < \frac{2}{\varepsilon}\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\) implies \(\ell(x,M(x))<2\tau\), which implies that \(M(x) \ge f(\tilde{x})\) for some \(\tilde{x}\subseteq x\) with \(\mathrm{dist}(x,\tilde{x})<2\tau\); by the definition of down sensitivity, \(|f(x)-f(\tilde{x})| \le \mathsf{DS}_f^{2\tau}(x)\) and so \(M(x) \ge f(\tilde{x}) \ge f(x) - \mathsf{DS}_f^{2\tau}(x)\), as required. ↩ ↩²
To the best of our knowledge, differentially private binary search was first proposed by Blum, Ligett, and Roth [BLR08]. This algorithmic idea has been used in various other papers [e.g., BSU17,FS17,DGMSS21] ↩
Note that we can also use Gaussian noise instead of Laplace noise. This would yield a slightly better accuracy guarantee for the same concentrated differential privacy guarantee. Specifically, this would give \(\tau = O\left(\sqrt{\frac1\rho \cdot \log |\mathcal{Y}| \cdot \log \left( \frac{\log | \mathcal{Y} |}{\beta}\right)}\right)\). ↩
We can shave the loglog term in Theorem 5 to get \(\tau = \Theta(\sqrt{\log|\mathcal{Y}|})\) either by using a noise-tolerant version of binary search [KK07] or by using non-independent noise [SU15,GZ20,GKM21,DK22]. ↩

Beyond Global Sensitivity via Inverse Sensitivity

Tue, 05 Sep 2023 09:00:00 -0700

The most well-known and widely-used method for achieving differential privacy is to compute the true function value \(f(x)\) and then add Laplace or Gaussian noise scaled to the global sensitivity of \(f\). This may be overly conservative. In this post we’ll show how we can do better.

The global sensitivity of a function \(f : \mathcal{X}^* \to \mathbb{R}\) is defined by \[ \mathsf{GS}_f := \sup_{x,x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le 1} |f(x)-f(x’)|, \tag{1}\] where \(\mathrm{dist}(x,x’)\le 1\) denotes that \(x\) and \(x’\) are neighbouring datasets (i.e. they differ only by the addition, removal, or replacement of one person’s data); more generally, \(\mathrm{dist}(\cdot,\cdot)\) is the corresponding metric on datasets (i.e., Hamming distance).¹

The global sensitivity considers datasests that have nothing to do with the dataset at hand and which could be completely unrealistic. Many functions have infinite global sensitivity, but, on reasonably nice datasets, their local sensitivity is much lower.

Local Sensitivity

The \(k\)-local sensitivity² of a function \(f : \mathcal{X}^* \to \mathbb{R}\) at \(x \in \mathcal{X}^*\) is defined by \[\mathsf{LS}^k_f(x) := \sup_{x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le k} |f(x)-f(x’)|. \tag{2}\] Often, we fix \(k=1\) and we may drop the superscript: \(\mathsf{LS}_f(x) := \mathsf{LS}_f^1(x)\). Note that the local sensitivity is always at most the global sensitivity: \(\mathsf{LS}_f^k(x) \le k \cdot \mathsf{GS}_f\).

As a concrete example, the median has infinite global sensitivity, but for realistic data the local sensitivity is quite reasonable. Specifically, \[\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n) = \max\left\{ \left|x_{(\tfrac{n+1}{2})}-x_{(\tfrac{n+1}{2}+k)}\right|, \left|x_{(\tfrac{n+1}{2})}-x_{(\tfrac{n+1}{2}-k)}\right| \right\},\tag{3}\] where \( x_{(1)} \le x_{(2)} \le \cdots \le x_{(n)}\) denotes the input in sorted order and \(n\) is assumed to be odd, so, in particular, \(\mathrm{median}(x_1, \cdots, x_n) = x_{(\tfrac{n+1}{2})}\). For example, if \(X_1, \cdots X_n\) are i.i.d. samples from a standard Gaussian and \(k \ll n\), then \(\mathsf{LS}^k_{\mathrm{median}}(X_1, \cdots, X_n) \le O(k/n)\) with high probability.

Using Local Sensitivity

Intuitively, the local sensitivity is the “real” sensitivity of the function and the global sensitivity is only a worst-case upper bound. Thus it seems natural to add noise scaled to the local sensitivity instead of the global sensitivity.

Unfortunately, naïvely adding noise scaled to local sensitivity doesn’t satisfy differential privacy. The problem is that the local sensitivity itself can reveal information. For example, consider the median on the inputs \(x=(1,2,2),x’=(2,2,2)\). The output distributions of the algorithm on these two inputs must be similar. In both cases the median is \(2\), so that is a good start for ensuring that the distributions are similar. But the local sensitivity is different: \(\mathsf{LS}^1_{\mathrm{median}}(x)=1\) versus \(\mathsf{LS}^1_{\mathrm{median}}(x’)=0\). So, if we add noise scaled to local sensitivity, then, on input \(x’\), we deterministically output \(2\), while, on input \(x\), we output a random number. If we use continuous Laplace or Gaussian noise, then the random number will be a non-integer almost surely. Thus the output perfectly distinguishes the two inputs, which is a catastrophic violation of differential privacy.

The good news is that we can exploit local sensitivity; we just need to do a bit more work. In fact, there are many methods in the differential privacy literature to exploit local sensitivity.

The best-known methods for exploiting local sensitivity are smooth sensitivity [NRS07]³ and propose-test-release [DL09]⁴.

In this post we will cover a different general-purpose technique. This technique is folklore.⁵ It was first systematically studied by Asi and Duchi [AD20,AD20], who also named the method the inverse sensitivity mechanism.

The Inverse Sensitivity Mechanism

Consider a function \(f : \mathcal{X}^* \to \mathcal{Y}\). Our goal is to estimate \(f(x)\) in a differentially private manner. But we do not make any assumptions about the global sensitivity of the function.

For simplicity we will assume that \(\mathcal{Y}\) is finite and that \(f\) is surjective.⁶

Now we define a loss function \(\ell : \mathcal{X}^* \times \mathcal{Y} \to \mathbb{Z}_{\ge0}\) by \[\ell(x,y) := \min\left\{ \mathrm{dist}(x,\tilde{x}) : \tilde{x}\in\mathcal{X}^*, f(\tilde{x})=y \right\}.\tag{4}\] In other words, \(\ell(x,y)\) measures how many entries of \(x\) we need to add or remove until \(f(x)=y\). Yet another way to think of it is that \(\ell(x,y)\) is the distance from the point \(x\) to the set \(f^{-1}(y)\). (Hence the name inverse sensitivity.)

The loss is minimized by the desired answer: \(\ell(x,f(x))=0\). Intuitively, the loss \(\ell(x,y)\) increases as \(y\) moves further from \(f(x)\). So approximately minimizing this loss should produce a good approximation to \(f(x)\), as desired.

The trick is that this loss always has bounded global sensitivity – i.e., \(\mathsf{GS}_\ell \le 1\) – no matter what the sensitivity of \(f\) is!

Lemma 1. Let \(f : \mathcal{X}^* \to \mathcal{Y}\) be arbitrary and define \(\ell : \mathcal{X}^* \times \mathcal{Y} \to \mathbb{Z}_{\ge0}\) as in Equation 4. Then, for all \(x,x’\in\mathcal{X}^*\) with \(\mathrm{dist}(x,x’)\le 1\) and all \(y \in \mathcal{Y}\), we have \(|\ell(x,y)-\ell(x’,y)|\le 1\).

Proof. Fix \(x,x’\in\mathcal{X}^*\) with \(\mathrm{dist}(x,x’)\le 1\) and \(y \in \mathcal{Y}\). Let \(\widehat{x} \in\mathcal{X}^*\) satisfy \(\ell(x,y)=\mathrm{dist}(x,\widehat{x})\) and \(f(\widehat{x})=y\). By definition, \[\ell(x’,y) = \min\left\{ \mathrm{dist}(x’,\tilde{x}) : f(\tilde{x})=y \right\} \le \mathrm{dist}(x’,\widehat{x}).\] By the triangle inequality, \[\mathrm{dist}(x’,\widehat{x}) \le \mathrm{dist}(x’,x)+\mathrm{dist}(x,\widehat{x}) \le 1 + \ell(x,y).\] Thus \(\ell(x’,y) \le \ell(x,y)+1\) and, by symmetry, \(\ell(x,y) \le \ell(x’,y)+1\), as required. ∎

This means that we can run the exponential mechanism [MT07] to select from \(\mathcal{Y}\) using the loss \(\ell\).⁷ That is, the inverse sensitivity mechanism is defined by \[\forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x)=y] ;= \frac{\exp\left(-\frac{\varepsilon}{2}\ell(x,y)\right)}{\sum_{y’\in\mathcal{Y}}\exp\left(-\frac{\varepsilon}{2}\ell(x,y’)\right)}.\tag{5}\] By the properties of the exponential mechanism and Lemma 1, \(M\) satisfies differential privacy:

Theorem 2. (Privacy of the Inverse Sensitivity Mechanism) Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. Then \(M\) satisfies \(\varepsilon\)-differential privacy (and \(\frac18\varepsilon^2\)-zCDP).

Utility Guarantee

The privacy guarantee of the inverse sensitivity mechanism is easy and, in particular, it doesn’t depend on the properties of \(f\). This means that the utility will need to depend on the properties of \(f\).

By the standard properties of the exponential mechanism, we can guaranatee that the output has low loss:

Lemma 3. Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. For all inputs \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1)\), we have \[\mathbb{P}\left[\ell(x,M(x)) < \frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right] \ge 1-\beta.\tag{6}\]

Proof. Let \(B_x = \left\{ y \in \mathcal{Y} : \ell(x,y) \ge \frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right\}\) be the subset of \(\mathcal{Y}\) with high loss. Then \[ \mathbb{P}[M(x)\in B_x] = \frac{\sum_{y \in B_x} \exp\left(-\frac{\varepsilon}{2}\ell(x,y)\right)}{\sum_{y’\in\mathcal{Y}}\exp\left(-\frac{\varepsilon}{2}\ell(x,y’)\right)} \]\[ \le \frac{|B_x| \cdot \exp\left(-\frac{\varepsilon}{2}\frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right) \right)}{\exp\left(-\frac{\varepsilon}{2}\ell(x,f(x))\right)}\]\[= \frac{|B_x| \cdot \frac{\beta}{|\mathcal{Y}|}}{1} \le \beta, \] as required. ∎

Now we need to translate this loss bound into something easier to interpret – local sensitivity.

Suppose \(y \gets M(x)\). Then we have some loss \(k=\ell(x,y)\). What this means is that there exists \(\tilde{x}\in\mathcal{X}^*\) with \(f(\tilde{x})=y\) and \(\mathrm{dist}(x,\tilde{x})\le k\). By the definition of local sensitivity, \(|f(x)-y| = |f(x)-f(\tilde{x})| \le \mathsf{LS}_f^k(x)\). This means we can translate the loss guarantee of Lemma 3 into an accuracy guarantee in terms of local sensitivity:

Theorem 4. (Utility of the Inverse Sensitivity Mechanism) Let \(M : \mathcal{X}^* \to \mathcal{Y}\) be as defined in Equation 5 with the loss from Equation 4. For all inputs \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1)\), we have \[\mathbb{P}\left[\left|M(x)-f(x)\right| \le \mathsf{LS}_f^k(x) \right] \ge 1-\beta,\tag{7}\] where \(k=\left\lfloor\frac2\varepsilon\log\left(\frac{|\mathcal{Y}|}{\beta}\right)\right\rfloor\).

We can tie this back to our concrete example of the median. Per Equation 3, \[\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n) \le \left|x_{(\tfrac{n+1}{2}+k)}-x_{(\tfrac{n+1}{2}-k)}\right| .\] Thus the error guarantee of Theorem 4 for the median would scale with the spread of the data. E.g., if \(k=\tfrac{n+1}{4}\), then \(\mathsf{LS}^k_{\mathrm{median}}(x_1, \cdots, x_n)\) is at most the interquartile range of the data.

How does this compare with the usual global sensitivity approach? The \(\varepsilon\)-differentially private Laplace mechanism is given by \(\widehat{M}(x):=f(x)+\mathsf{Laplace}(\mathsf{GS}_f/\varepsilon)\). For all \(x \in \mathcal{X}^*\) and all \(\beta\in(0,1/2)\), we have the utility guarantee \[\mathbb{P}\left[\left|\widehat{M}(x)-f(x)\right| \le \mathsf{GS}_f \cdot \frac1\varepsilon \log\left(\frac{1}{2\beta}\right) \right] \ge 1-\beta.\tag{8}\] Comparing Equations 7 and 8, we see that neither guarantee dominates the other. On one hand, the local sensitivity can be much smaller than the global sensitivity. On the other hand, we pick up a dependence on \(\log|\mathcal{Y}|\). In particular, in the worst case where the local sensitivity matches the global sensitivity \(\mathsf{LS}_f^k(x)=k\cdot\mathsf{GS}_f\), the inverse sensitivity mechanism is worse by a factor of \[\frac{\mathsf{LS}_f^k(x)}{\mathsf{GS}_f \cdot \frac1\varepsilon \log\left(\frac{1}{2\beta}\right)} = 2 \frac{\log(2|\mathcal{Y}|)}{\log(1/2\beta)}+2.\tag{9}\] Hence the inverse sensitivity mechanism is most useful in situations where the local sensitivity is significantly smaller than the global sensitivity.

Conclusion

In this post we’ve covered the inverse sensitivity mechanism and showed that it is private regardless of the sensitivity of the function \(f\) and we showed that it gives error guarantees that scale with the local sensitivity of \(f\), rather than its global sensitivity.

The inverse sensitivity mechanism is a simple demonstration that there is more to differential privacy than simply adding noise scaled to global sensitivity; there are many more techniques in the literature.

The inverse sensitivity mechanism has two main limitations. First, it is, in general, not computationally efficient. Computing the loss function is intractable for an arbitrary \(f\) (but can be done efficiently for several examples like the median and variants of principal component analysis and linear regression [AD20]). Second, the \(\log|\mathcal{Y}|\) term in the accuracy guarantee is problematic when the output space is large, such as when we have high-dimensional outputs. While there are other techniques that can be used instead of inverse sensitivity, they suffer from some of the same limitations. Thus finding ways around these limitations is an active research topic [BKSW19,FDY22,HKMN23,DHK23,BHS23,AUZ23].

The inverse sensitivity mechanism’s accuracy can be shown to be instance-optimal up to logarithmic factors [AD20,AD20]. That is, up to logarithmic factors, no differentially private mechanism can achieve better error guarantees. Up to logarithmic factors, the inverse sensitivity mechanism outperforms other methods for exploiting local sensitivity, namely smooth sensitivity [NRS07]³ and propose-test-release [DL09]⁴.

We leave you with a riddle: What can we do if even the local sensitivity of our function is unbounded? For example, suppose we want to approximate \(f(x) = \max_i x_i\). Surprisingly, there are still things we can do; see our follow-up post.

We define \(\mathcal{X}^* = \bigcup_{n = 0}^\infty \mathcal{X}^n\) to be the set of all input tuples of arbitrary size. The metric \(\mathrm{dist} : \mathcal{X}^* \times \mathcal{X}^* \to \mathbb{R}\) can be arbitrary. E.g. we can allow addition, removal, and/or replacement of an individual’s data. For simplicity, we consider univariate functions here. But the definitions of global and local sensitivity easily extend to to vector-valued functions by taking a norm: \[ \mathsf{GS}_f := \sup_{x,x’\in\mathcal{X}^* : \mathrm{dist}(x,x’) \le 1} \|f(x)-f(x’)\|.\] If we use the 2-norm, then this cleanly corresponds to adding spherical Gaussian noise. The 1-norm corresponds to adding independent Laplace noise to the coordinates. ↩
The local sensitivity is also known as the local modulus of continuity [AD20,AD20]. Note that this should not be confused with the local sensitivity at distance \(k\) [NRS07], which is defined by \(\sup \{ \mathsf{LS}_f^1(x’) : \mathrm{dist}(x,x’) \le k \}\). ↩
Briefly, smooth sensitivity is an upper bound on the local sensitivity which itself has low sensitivity in a multiplicative sense. That is, \(\mathsf{LS}_f^1(x) \le \mathsf{SS}_f^t(x)\) and \(\mathsf{SS}_f^t(x) \le e^t \cdot \mathsf{SS}_f^t(x’) \) for neighbouring \(x,x’\). This suffices to ensure that we can add noise scaled to \(\mathsf{SS}_f^t(x)\). However, that noise usually needs to be more heavy-tailed than for global sensitivity [BS19]. ↩ ↩²
Roughly, the propose-test-release framework computes an upper bound on the local sensitivity in a differentially private manner and then uses this upper bound as the noise scale. (We hope to give more detail about both propose-test-release and smooth sensitivity in future posts.) ↩ ↩²
Properly attributing the inverse sensitivity mechanism is difficult. The earliest published instances of the inverse sensitivity mechanism of which we are aware of are from 2011 and 2013 [MMNW11§3.1,JS13§5]; but this was not novel even then. Asi and Duchi [AD20§1.2] state that McSherry and Talwar [MT07] considered it in 2007. In any case, the name we use was coined in 2020 [AD20]. ↩
Assuming that the output space \(\mathcal{Y}\) is finite is a significant assumption. While it can be relaxed a bit [AD20], it is to some extent an unavoidable limitation [BNSV15,ALMM19]. For example, to apply the inverse sensitivity mechanism to the median, we must discretize and bound the inputs; bounding the inputs does impose a finite global sensitivity, but the dependence on the bound is logarithmic, so the bound can be fairly large. Assuming that the function is surjective is a minor assumption that ensures that the loss in Equation 4 is always well-defined; otherwise we can define the loss to be infinite for points that are not in the range of the function. ↩
Note that we can use other selection algorithms, such as permute-and-flip [MS20] or report-noisy-max [DKSSWXZ21] or gap-max [CHS14,BDRS18,BKSW19]. ↩

Covariance-Aware Private Mean Estimation, Efficiently

Mon, 17 Jul 2023 12:00:00 -0400

Last week, the Mark Fulk award for best student paper at COLT 2023 was awarded to the following two papers on private mean estimation:

The main result of both papers is the same: the first computationally-efficient \(O(d)\)-sample algorithm for differentially-private Gaussian mean estimation in Mahalanobis distance. In this post, we’re going to unpack the result and explain what this means.

Gaussian mean estimation is a classic statistical task: given \(X_1, \dots, X_n \in \mathbb{R}^d\) sampled i.i.d. from a \(d\)-dimensional Gaussian \(N(\mu, \Sigma)\), output an vector \(\hat \mu \in \mathbb{R}^d\) that approximates the true mean \(\mu \in \mathbb{R}^d\). But what do we mean by approximates? What distance measure should we use? A reasonable first guess is the \(\ell_2\)-norm: output an estimate \(\hat \mu\) that minimizes \(\|\hat \mu - \mu\|_2\).

However, we would ideally measure the quality of an estimate in an affine-invariant manner: if the problem instance (i.e., the estimate, the dataset, and the underlying distribution) is shifted and rescaled, then the error should remain unchanged. Affine invariance allows us to perform such transformations of our data and not artificially make the problem easier or harder. This property clearly isn’t satisfied by the \(\ell_2\)-norm: simply scaling the problem down would allow us to report an estimate with arbitrarily low error. In other words, the distance metric needs to be calibrated to the covariance \(\Sigma \in \mathbb{R}^{d \times d}\).

Instead, we consider error measured according to the Mahalanobis distance: output an estimate \(\hat \mu\) that minimizes \(\|\Sigma^{-1/2}(\hat \mu - \mu)\|_2\), where \(\Sigma\) is the (unknown) covariance of the underlying distribution. Note that, if the covariance matrix \(\Sigma = I\), then this reduces to the \(\ell_2\)-distance. Indeed, a valid interpretation of the Mahalanobis distance is to imagine rescaling the problem so that the covariance \(\Sigma\) is mapped to the identity matrix, and measuring \(\ell_2\)-distance after this transformation. A common way to think about Mahalanobis distance operationally is that it necessitates a more accurate estimate in directions with small variance, while permitting more error in directions with large variance.

OK, so how do we learn the mean of a Gaussian in Mahalanobis distance? In the non-private setting, the answer is simple: just take the empirical mean \(\hat \mu = \frac{1}{n} \sum_{i=1}^n X_i\)! It turns out that with \(O(d)\) samples, the empirical mean provides an accurate estimate (in Mahalanobis distance) of the true mean \(\mu\). Note that these guarantees hold regardless of the true covariance matrix \(\Sigma\).

It isn’t quite so easy when we want to do things privately. The most natural way would be add noise to the empirical mean. However, we first have to “clip” the datapoints (i.e., rescale any points that are “too large”) in order to limit the sensitivity of this statistic. This is where the challenges arise: we would ideally like to clip the data based on the shape of the (unknown) covariance matrix \(\Sigma\) [KLSU19]. Deviating significantly from \(\Sigma\) would either introduce bias due to clipping too many points, or add excessive amounts of noise. Unfortunately, the covariance matrix \(\Sigma\) is unknown, and privately estimating it (in an appropriate metric) requires \(\Omega(d^{3/2})\) samples [KMS22]. This is substantially larger than the \(O(d)\) sample complexity of non-private Gaussian mean estimation. Furthermore, this covariance estimation step really is the bottleneck. Given a coarse estimate of \(\Sigma\), only \(O(d)\) additional samples are required to estimate the mean privately in Mahalanobis distance. This leads to the intriguing question: is it possible to privately estimate the mean of a Gaussian without explicitly estimating the covariance matrix?

The answer is yes! A couple years back, Brown, Gaboardi, Smith, Ullman, and Zakynthinou [BGSUZ21] gave two different algorithms for private Gaussian mean estimation in Mahalanobis distance, which both require only \(O(d)\) samples. Interestingly, the two algorithms are quite different from each other. One simply adds noise to the empirical mean based on the empirical covariance matrix. The other one turns to a technique from robust statistics, sampling a point with large Tukey depth using the exponential mechanism. As described here, neither of these methods is differentially private yet – they additionally require a pre-processing step which checks if the dataset is sufficiently well-behaved, which happens with high probability when the data is generated according to a Gaussian distribution. The major drawback of both algorithms: they require exponential time to compute.

The two awarded papers [DHK23] and [BHS23] resolve this issue, giving the first computationally efficient \(O(d)\) sample algorithms for private mean estimation in Mahalanobis distance. Interestingly, the algorithms in both papers follow the same recipe as the first algorithm mentioned above: add noise to the empirical mean based on the empirical covariance matrix. The catch is that the empirical mean and covariance are replaced with stable estimates of the empirical mean and covariance, where stability bounds how much the estimators can change due to modification of individual datapoints. Importantly, these stable estimators are efficient to compute. Further details of these subroutines are beyond the scope of this post, but the final algorithm simply adds noise to the stably-estimated mean based on the stably-estimated covariance. Different extensions of these results are explored in the two papers, including estimation of covariance, and mean estimation in settings where the distribution may be heavy-tailed or rank-deficient.

Most of the algorithms described above are based on some notion of robustness, thus suggesting connections to the mature literature on robust statistics. These connections have been explored as far back as 2009, in foundational work by Dwork and Lei [DL09]. Over the last couple of years, there has been a flurry of renewed interest in links between robustness and privacy, including, e.g., [BKSW19, KSU20, KMV22, LKO22, HKM22, GH22, HKMN23, AKTVZ23, AUZ23], beyond those mentioned above. For example, some works [GH22, HKMN23, AUZ23] show that, under certain conditions, a robust estimator implies a private one, and vice versa. The two awarded papers expand this literature in a somewhat different direction – the type of stability property considered leads to algorithms which qualitatively differ from those considered prior. It will be interesting to see how private and robust estimation evolve together over the next several years.

Congratulations once more to the authors of both awarded papers on their excellent results!

Call for Papers - TPDP 2023 - Submission deadline July 7

Wed, 28 Jun 2023 00:01:00 +0000

The 9th Workshop on the Theory and Practice of Differential Privacy (TPDP 2023) will take place in Boston September 27-28, 2023. This is the first year the workshop is a standalone event. However, the OpenDP community meeting is the following day (also in Boston). It is also moving from a one-day event to two days.

The workshop is intended to bring together the DP research community to discuss new developments over the past year. The workshop is non-archival, so does not preclude publishing the work elsewhere.

The submission deadline is July 7. Submissions should be 4 pages (plus references and appendices.)

Submission website: https://hcrp.cs.uchicago.edu

Open problem - Better privacy guarantees for larger groups

Mon, 26 Jun 2023 21:00:00 -0400

Consider a simple query counting the number of people in various mutually exclusive groups. In the differential privacy literature, it is typical to assume that each of these groups should be subject to the same privacy loss: the noise added to each count has the same magnitude, and everyone gets the same privacy guarantees. However, in settings where these groups have vastly different population sizes, larger populations may be willing to accept more error in exchange for stronger privacy protections. In particular, in many use cases, relative error (the noisy count is within 5% of the true value) matters more than absolute error (the noisy count is at a distance of at most 100 of the true value). This leads to a natural question: can we use this fact to develop a mechanism that improves the privacy guarantees of individuals in larger groups, subject to a constraint on relative error?

Problem definition

Our goal is to obtain a mechanism which minimizes the overall privacy loss for each group without exceeding a relative error threshold for each group. To formalize this goal, we first define a notion of per-group privacy we call group-wise zero-concentrated differential privacy as follows.

Definition. Group-wise zero-concentrated differential privacy. Assume possible datasets consist of records from domain \(U\), and \(U\) can be partitioned into \(k\) fixed, disjoint groups \(U_1\), …, \(U_k\). Let \(v : \mathcal{D} \rightarrow \mathbb{R}^k\) be a function associating a dataset to a vector of privacy budgets (one per group). We say a mechanism \(\mathcal{M}\) satisfies \(v\)-group-wise zero-concentrated differential privacy (zCDP) if for any two datasets \(D\), \(D’\) differing in the addition or removal of a record in \(U_i\), and for all \(\alpha>1\), we have: \[ D_\alpha\left(\mathcal{M}(D||\mathcal{M}(D’)\right) \le \alpha \cdot {v(D)}_i \] \[ D_\alpha\left(\mathcal{M}(D’)||\mathcal{M}(D)\right) \le \alpha \cdot {v(D)}_i \] where \(D_\alpha\) is the Rényi divergence of order \(\alpha\).

This definition is similar to tailored DP, defined in [LP15]: each individual gets a different privacy guarantee, depending on which group they belong to; this guarantee also depends on how many people are in this group. We use zCDP as our definition of privacy due to its compatibility with the Gaussian mechanism; the same idea could easily be applied to other definitions like with Rényi DP or pure DP.

From there we can give a more formal definition of the problem as follows. The goal is to minimize the privacy loss for each individual group, while keeping the error under a given threshold. For larger groups that can accept more noise, this means adding more noise to achieve the smallest possible privacy loss.

Problem. Let \(r \in (0,1]\) be an acceptable level of relative error, and \(k\) be the number of distinct, mutually-exclusive partitions of domain \(X\). Given a dataset \(D\), let \(x(D)\) be a vector containing the count of records in each partition. The objective is to find a mechanism \(\mathcal{M}\) which takes in \(r\), \(k\), and \(D\) and outputs \(\hat{x}(D)\) such that \(E\left[\left|{x(D)}_i-{\hat{x}(D)}_i\right|\right]<r\cdot {x(D)}_i\) for all \(i\), and satisfies \(v\)-group-wise zCDP where \(v(D)_i\) is as small as possible for all \(i\).
To prevent pathological mechanisms that optimize for specific datasets, we add two constraints to the problem: the privacy guarantee \(v(D)_i\) should only depend on \(x(D)_i\), and should be nonincreasing with \(x(D)_i\).

Since the relative error thresholds are proportional to the population size, each population can tolerate a different amount of noise. This means that to minimize the privacy loss for each group, the mechanism must add noise of different scales to each group. Of course, directly using \(x(D)_i\) to determine the scale of the noise for group \(i\) leads to a privacy loss which is data dependent, similarly to e.g. PATE [PAEGT17], and as such should be treated as a protected value.

An example mechanism

An example mechanism that seems like it could address this problem is as follows. First, perform the original counting query and add Gaussian noise to satisfy \(\rho\)-zCDP. Then, add additional Gaussian noise to each count, with a variance that depends on the noisy count itself — adding more noise to larger groups. This mechanism is outlined in Algorithm 1.

Algorithm 1. Adding data-dependent noise as a post-processing step.
Require: A dataset \(D\) where each data point belongs to one of \(k\) groups, a privacy parameter \(\rho\), and a relative error rate \(r\).

Let \(\sigma^2 = 1/(2\rho)\)
For \(i=1\) to \(k\) do
\(\qquad\) Let \(x_i\) be the number of people in \(D\) in group \(i\)
\(\qquad\) Sample \(X_i \sim \mathcal{N}(x_i, \sigma^2)\)
\(\qquad\) Sample \(Y_i \sim \mathcal{N}_{k}(X_i, (rX_i)^2)\)
end for
return \(Y_1,\dots,Y_k\)

Algorithm 1 achieves this goal of having approximately \(r\) error in each group: the total variance error of the mechanism is \(\sigma^2 + (rX)^2\), and \(X\) is a zCDP measure of \(f(D)\). This mechanism satisfies at least \(\rho\)-zCDP: line 4 is an invocation of the Gaussian mechanism with privacy parameter \(\rho\), and line 5 is a post processing step and as such preserves the zCDP guarantee. We would like to show that this algorithm also satisfies a stronger group-wise zCDP guarantee.

This makes intuitive sense: line 5 adds additional Gaussian noise without using the private data directly. Since the noise scale in line 5 is proportional to the total count in line 4, we expect the privacy guarantee to be significantly stronger for large groups with more noise. Further, we can verify experimentally that when the data magnitude is large compared to the noise, the output distribution for each group is close to a Gaussian distribution.

The below figure illustrates this finding. We plot 1,000,000 sample outputs of Algorithm 1 (red) with parameters \(\sigma^2 = 100\) and \(r= 0.3\), and compare it to the best fit Gaussian distribution (black outline) with mean \(10,002.6\) and standard deviation of \(2995.1\).

With parameters such as these, the output of the mechanism looks and behaves like a Gaussian distribution, which should be ideal to characterize the zCDP guarantee. However, it is difficult to directly quantify this guarantee, due to the changing variance which is also a random variable. Likewise, if the true count is close to zero or if the first instance of noise is large compared to the true count than the resulting distribution takes on a heavy skew and is no longer similar to a single Gaussian distribution. Such distributions with randomized variances have not, to the best of our knowledge, been considered much in the literature, and we do not know whether the mechanism’s output distribution follows some well-studied distribution.

The randomized variance also makes it difficult to bound the Rényi divergence of the distribution and characterize the zCDP guarantees directly. Current privacy amplification techniques are insufficient, as those techniques consider adding additional noise where the noise parameters are independent of the data itself.

Perhaps the most promising direction to understand more about such processes is the area of stochastic differential equations, where it is common to study noise with data-dependent variance. The Bessel process [Øks03] is an example of such a process, where the noise is dependent on the current value. This process captures the noise added as post-processing (Line 5), but not the initial noise-addition step (Line 4). Furthermore, to the best of our knowledge, the Bessel process and other value-dependent stochastic differential equations do not have closed-form solutions.

Goal

We see two possible paths forward to address the original question. One path would be to obtain an analysis of Algorithm 1 which shows non-trivial improved privacy guarantees for larger groups. We tried multiple approaches, but could not prove such a result.

An alternative path would be to develop a different algorithm, which achieves better privacy guarantees for larger groups while maintaining the error below the relative error threshold for all groups.

Composition Basics

Tue, 01 Nov 2022 11:45:00 -0400

Our data is subject to many different uses. Many entities will have access to our data and those entities will perform many different analyses that involve our data. The greatest risk to privacy is that an attacker will combine multiple pieces of information from the same or different sources and that the combination of these will reveal sensitive details about us. Thus we cannot study privacy leakage in a vacuum; it is important that we can reason about the accumulated privacy leakage over multiple independent analyses, which is known as composition. We have previously discussed why composition is so important for differential privacy.

This is the first in a series of posts on composition in which we will explain in more detail how compositoin analyses work.

Composition is quantitative. The differential privacy guarantee of the overall system will depend on the number of analyses and the privacy parameters that they each satisfy. The exact relationship between these quantities can be complex. There are various composition theorems that give bounds on the overall parameters in terms of the parameters of the parts of the system.

The simplest composition theorem is what is known as basic composition, which applies to pure \(\varepsilon\)-DP (although it can be extended to approximate \((\varepsilon,\delta)\)-DP):

Theorem (Basic Composition) Let \(M_1, M_2, \cdots, M_k : \mathcal{X}^n \to \mathcal{Y}\) be randomized algorithms. Suppose \(M_j\) is \(\varepsilon_j\)-DP for each \(j \in [k]\). Define \(M : \mathcal{X}^n \to \mathcal{Y}^k\) by \(M(x)=(M_1(x),M_2(x),\cdots,M_k(x))\), where each algorithm is run independently. Then \(M\) is \(\varepsilon\)-DP for \(\varepsilon = \sum_{j=1}^k \varepsilon_j\).

Proof. Fix an arbitrary pair of neighbouring datasets \(x,x’ \in \mathcal{X}^n\) and output \(y \in \mathcal{Y}^k\). To establish that \(M\) is \(\varepsilon\)-DP, we must show that \(e^{-\varepsilon} \le \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^\varepsilon\). By independence, we have \[\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\prod_{j=1}^k\mathbb{P}[M_j(x)=y_j]}{\prod_{j=1}^k\mathbb{P}[M_j(x’)=y_j]} = \prod_{j=1}^k \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \le \prod_{j=1}^k e^{\varepsilon_j} = e^{\sum_{j=1}^k \varepsilon_j} = e^\varepsilon,\] where the inequality follows from the fact that each \(M_j\) is \(\varepsilon_j\)-DP and, hence, \(e^{-\varepsilon_j} \le \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \le e^{\varepsilon_j}\). Similarly, \(\prod_{j=1}^k \frac{\mathbb{P}[M_j(x)=y_j]}{\mathbb{P}[M_j(x’)=y_j]} \ge \prod_{j=1}^k e^{-\varepsilon_j}\), which completes the proof. ∎

Basic composition is already a powerful result, despite its simple proof; it establishes the versatility of differential privacy and allows us to begin reasoning about complex systems in terms of their building blocks. For example, suppose we have \(k\) functions \(f_1, \cdots, f_k : \mathcal{X}^n \to \mathbb{R}\) each of sensitivity \(1\). For each \(j \in [k]\), we know that adding \(\mathsf{Laplace}(1/\varepsilon)\) noise to the value of \(f_j(x)\) satisfies \(\varepsilon\)-DP. Thus, if we add independent \(\mathsf{Laplace}(1/\varepsilon)\) noise to each value \(f_j(x)\) for all \(j \in [k]\), then basic composition tells us that releasing this vector of \(k\) noisy values satisfies \(k\varepsilon\)-DP. If we want the overall system to be \(\varepsilon\)-DP, then we should add independent \(\mathsf{Laplace}(k/\varepsilon)\) noise to each value \(f_j(x)\).

Is Basic Composition Optimal?

If we want to release \(k\) values each of sensitivity \(1\) (as above) and have the overall release be \(\varepsilon\)-DP, then, using basic composition, we can add \(\mathsf{Laplace}(k/\varepsilon)\) noise to each value. The variance of the noise for each value is \(2k^2/\varepsilon^2\), so the standard deviation is \(\sqrt{2} k /\varepsilon\). In other words, the scale of the noise must grow linearly with the number of values \(k\) if the overall privacy and each value’s sensitivity is fixed. It is natural to wonder whether the scale of the Laplace noise can be reduced by improving the basic composition result. We now show that this is not possible.

For each \(j \in [k]\), let \(M_j : \mathcal{X}^n \to \mathbb{R}\) be the algorithm that releases \(f_j(x)\) with \(\mathsf{Laplace}(k/\varepsilon)\) noise added. Let \(M : \mathcal{X}^n \to \mathbb{R}^k\) be the composition of these \(k\) algorithms. Then \(M_j\) is \(\varepsilon/k\)-DP for each \(j \in [k]\) and basic composition tells us that \(M\) is \(\varepsilon\)-DP. The question is whether \(M\) satisfies a better DP guarantee than this – i.e., does \(M\) satisfy \(\varepsilon_*\)-DP for some \(\varepsilon_*<\varepsilon\)? Suppose we have neighbouring datasets \(x,x’\in\mathcal{X}^n\) such that \(f_j(x) = f_j(x’)+1\) for each \(j \in [k]\). Let \(y=(a,a,\cdots,a) \in \mathbb{R}^k\) for some \(a \ge \max_{j=1}^k f_j(x)\). Then \[ \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\prod_{j=1}^k \mathbb{P}[f_j(x)+\mathsf{Laplace}(k/\varepsilon)=y_j]}{\prod_{j=1}^k \mathbb{P}[f_j(x’)+\mathsf{Laplace}(k/\varepsilon)=y_j]} \] \[ = \prod_{j=1}^k \frac{\frac{\varepsilon}{2k}\exp\left(-\frac{\varepsilon}{k} |y_j-f_j(x)| \right)}{\frac{\varepsilon}{2k}\exp\left(-\frac{\varepsilon}{k} |y_j-f_j(x’)| \right)} = \prod_{j=1}^k \frac{\exp\left(-\frac{\varepsilon}{k} (y_j-f_j(x)) \right)}{\exp\left(-\frac{\varepsilon}{k} (y_j-f_j(x’)) \right)} \] \[ = \prod_{j=1}^k \exp\left(\frac{\varepsilon}{k}\left(f_j(x)-f_j(x’)\right)\right) = \exp\left( \frac{\varepsilon}{k} \sum_{j=1}^k \left(f_j(x)-f_j(x’)\right)\right)= e^\varepsilon, \] where the third equality removes the absolute values because \(y_j \ge f_j(x)\) and \(y_j \ge f_j(x’)\). This shows that basic composition is optimal. For this example, we cannot prove a better guarantee than what is given by basic composition.

Is there some other way to improve upon basic composition that circumvents this example? Note that we assumed that there are neighbouring datasets \(x,x’\in\mathcal{X}^n\) such that \(f_j(x) = f_j(x’)+1\) for each \(j \in [k]\). In some settings, no such worst case datasets exist. In that case, instead of scaling the noise linearly with \(k\), we can scale the Laplace noise according to the \(\ell_1\) sensitivity \(\Delta_1 := \sup_{x,x’ \in \mathcal{X}^n \atop \text{neighbouring}} \sum_{j=1}^k |f_j(x)-f_j(x’)|\).

Instead of adding assumptions to the problem, we will look more closely at the example above. We showed that there exists some output \(y \in \mathbb{R}^d\) such that \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = e^\varepsilon\). However, such outputs \(y\) are very rare, as we require \(y_j \ge \max\{f_j(x),f_j(x’)\}\) for each \(j \in [k]\) where \(y_j = f_j(x) + \mathsf{Laplace}(k/\varepsilon)\). Thus, in order to observe an output \(y\) such that the likelihood ratio is maximal, all of the \(k\) Laplace noise samples must be positive, which happens with probability \(2^{-k}\). The fact that outputs \(y\) with maximal likelihood ratio are exceedingly rare turns out to be a general phenomenon and not specific to the example above.

Can we improve on basic composition if we only ask for a high probability bound? That is, instead of demanding \(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} \le e^{\varepsilon_*}\) for all \(y \in \mathcal{Y}\), we demand \(\mathbb{P}_{Y \gets M(x)}\left[\frac{\mathbb{P}[M(x)=Y]}{\mathbb{P}[M(x’)=Y]} \le e^{\varepsilon_*}\right] \ge 1-\delta\) for some \(0 < \delta \ll 1\). Can we prove a better bound \(\varepsilon_* < \varepsilon\) in this relaxed setting? The answer turns out to be yes.

The limitation of pure \(\varepsilon\)-DP is that events with tiny probability – which are negligible in real-world applications – can dominate the privacy analysis. This motivates us to move to relaxed notions of differential privacy, such as approximate \((\varepsilon,\delta)\)-DP and concentrated DP, which are less sensitive to low probability events.

Preview: Advanced Composition

By moving to approximate \((\varepsilon,\delta)\)-DP with \(\delta>0\), we can prove an asymptotically better composition theorem, which is known as the advanced composition theorem [DRV10].

Theorem (Advanced Composition Starting from Pure DP¹) Let \(M_1, M_2, \cdots, M_k : \mathcal{X}^n \to \mathcal{Y}\) be randomized algorithms. Suppose \(M_j\) is \(\varepsilon_j\)-DP for each \(j \in [k]\). Define \(M : \mathcal{X}^n \to \mathcal{Y}^k\) by \(M(x)=(M_1(x),M_2(x),\cdots,M_k(x))\), where each algorithm is run independently. Then \(M\) is \((\varepsilon,\delta)\)-DP for any \(\delta>0\) with \[\varepsilon = \frac12 \sum_{j=1}^k \varepsilon_j^2 + \sqrt{2\log(1/\delta) \sum_{j=1}^k \varepsilon_j^2}.\]

Recall that basic composition gives \(\delta=0\) and \(\varepsilon = \sum_{j=1}^k \varepsilon_j\). That is, basic composition scales with the 1-norm of the vector \((\varepsilon_1, \varepsilon_2, \cdots, \varepsilon_k)\), whereas advanced composition scales with the 2-norm of this vector (and the squared 2-norm). Neither bound strictly dominates the other. However, asymptotically (in a sense we will make precise in the next paragraph) advanced composition dominates basic composition.

Suppose we have a fixed \((\varepsilon,\delta)\)-DP guarantee for the entire system and we must answer \(k\) queries of sensitivity \(1\). Using basic composition, we can answer each query by adding \(\mathsf{Laplace}(k/\varepsilon)\) noise to each answer. However, using advanced composition, we can answer each query by adding \(\mathsf{Laplace}(\sqrt{k/2\rho})\) noise to each answer, where² \[\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}.\] If the privacy parameters \(\varepsilon,\delta>0\) are fixed (which implies \(\rho\) is fixed) and \(k \to \infty\), we can see that asymptotically advanced composition gives noise per query scaling as \(\Theta(\sqrt{k})\), while basic composition results in noise scaling as \(\Theta(k)\).

In the next few posts we will explain how advanced composition works. We hope this conveys an intuitive understanding of composition and, in particular, how this \(\sqrt{k}\) asymptotic behaviour arises. If you want to read ahead, these posts are extracts from this book chapter.

This result generalizes to approximate DP. If instead we assume \(M_j\) is \((\varepsilon_j,\delta_j)\)-DP for each \(j \in [k]\), then the final composition is \((\varepsilon,\delta+\sum_{j=1}^k \delta_j)\)-DP with \(\varepsilon\) as before. ↩
Adding \(\mathsf{Laplace}(\sqrt{k/2\rho})\) noise to a sensitivity-1 query ensures \(\varepsilon_j\)-DP for \(\varepsilon_j = \sqrt{2\rho/k}\). Hence \(\sum_{j=1}^k \varepsilon_j^2 = 2\rho\). Setting \(\rho = \frac{\varepsilon^2}{4\log(1/\delta)+4\varepsilon}\) ensures that \(\frac12 \sum_{j=1}^k \varepsilon_j^2 + \sqrt{2\log(1/\delta) \sum_{j=1}^k \varepsilon_j^2} = \rho + \sqrt{4\rho\log(1/\delta)} \le \varepsilon\). ↩

Privacy Doona: Why We Should Hide Among The Clones

Tue, 24 May 2022 11:45:00 -0400

In this blog post, we will discuss a recent(ish) result of Feldman, McMillan, and Talwar [FMT21], which provides an improved and simple analysis of the so-called “amplification by shuffling” formally connecting local privacy (LDP) and shuffle privacy.¹ Now, I’ll assume the reader is familiar with both LDP and Shuffle DP: if not, a quick-and-dirty refresher (with less quick, and less dirty references) can be found here, and of course there is also Albert Cheu’s excellent survey on Shuffle DP [Cheu21].

I will also ignore most of the historical details, but it is worth mentioning that [FMT21] is not the first paper on this “amplification by shuffling,” (which, for local reasons, I’ll just call a privacy doona) but rather is the culmination of a rather long line of work involving many cool ideas and papers, starting with [CSUZZ19, EFMRTT19]: I’d refer the reader to Table 1 in [FMT21] for an overview.

Alright, now that the caveats are behind us, what is “amplification by shuffling”? In a nutshell, it is capturing the (false!) intuition that “anonymization provides privacy” (which, again, is false! Don’t do this!) and making it… less false. The idea is that while anonymization does not provide in itself any meaningful privacy guarantee, it can amplify existing, rigorous privacy guarantee. So if I start with a somewhat lousy LDP guarantee, but then all the messages sent by all users are completely anonymized, then my lousy LDP guarantee suddenly gets much stronger (roughly speaking, the \(\varepsilon\) parameter goes down with the square root of of the number of users involved). Which is wonderful! Let’s see what this means, quantitatively.

The result of Feldman, McMillan, and Talwar.

Here, we will focus on the simpler case of noninteractive protocols (one-shot messages from the users to the central server, no funny business with messages going back and forth); which is conceptually simpler to state and parse, still very rich and interesting, and, well, very relevant in practice (being the easiest and cheapest to deploy). If you want the results in their full glorious generality, though, they are in the paper.

What the main theorem of [FMT21] is saying for this noninteractive setting can then be stated as follows: if I have an \(\varepsilon_L\)-locally private (LDP) protocol for a task, where all \(n\) users pass their data through the same randomizer (algorithm) \(R\) and send the resulting message \(y_i \gets R(x_i)\), then just permuting the messages \(y_1\dots,y_n\) immediately gives an \((\varepsilon,\delta)\)-shuffle private protocol for the same task, for any pair \((\varepsilon,\delta)\) which satisfies \begin{equation} \varepsilon \leq \log\left( 1+ 16\frac{e^{\varepsilon_{L}}-1}{e^{\varepsilon_{L}}+1}\sqrt{\frac{e^{\varepsilon_{L}}\log\frac{4}{\delta}}{n}}\right) \tag{1} \end{equation} as long as \(n \gg e^{\varepsilon_{L}}\log(1/\delta)\). That is quite a lot to parse, though: what does this actually mean?

First, the assumption that all users have the same randomizer (or at least cannot be distinguished by their randomizer) is quite natural: if they didn’t, then we wouldn’t be able to say anything in general, since the randomizer they use could just give away their identity completely. For instance, as an extreme case, the randomizer of user \(i\) could just append \(i\) to the message (it’s OK, still LDP!), and then shuffling achieves exactly nothing: we know who sent what. So OK, asking for all randomizers to be the same is not really a restriction.

Second, each user only sends one message, and this preserves its length (we just shuffled the messages, didn’t modify them!). So if you start with an LDP protocol with amazing features XYZ (e.g., the messages are \(1\)-bit long, or users don’t share a random seed, or the randomizers run in time \(O(1)\)), then the shuffle protocol enjoys exactly the same properties. (It only enjoys naturally some robustness, in the sense that if \(10\%\) if the \(n\) users maliciously deviate from the protocol, they can’t really jeopardize the privacy of the remaining \(90\%\) of users.² Which is… good.)

Third, this is inherently approximate DP. Here we started with pure LDP (you can also extend that to approximate LDP) and ended up with approximate Shuffle DP: this is not a mistake, that’s how it is. I am not a purist (erm) myself, and that looks more than good enough to me; but if you seek pure Shuffle DP, then this result is not the droid you’re looking for.

Alright, what is this guarantee stated in (1) giving us? Let’s interpret the expression in (1) in two parameter regimes, focusing on \(\varepsilon\) (fixing some small \(\delta>0\)). If we start with \(\varepsilon_{L} \ll 1\) for our LDP randomizers \(R\), then a first-order Taylor expansion shows that we get \[ \varepsilon \approx \varepsilon_{L}\cdot 8\sqrt{\frac{\log\frac{4}{\delta}}{n}} \] so that shuffling improved our privacy parameter by a factor \(\sqrt{n}\).³ 😲 This is great! With more users, comes more privacy!

But that was starting with small \(\varepsilon_{L}\), that is, already pretty good privacy guarantees for our LDP “building block” \(R\). What happens if we start with “somewhat lousy” privacy guarantees, that is, \(\varepsilon_{L} \gg 1\)? Do we get anything interesting then? Another Taylor expansion (everything is a Taylor expansion) shows us that, then, \[ \begin{equation} \varepsilon \approx \log\left( 1+ 8\sqrt{\frac{e^{\varepsilon_{L}}\log\frac{4}{\delta}}{n}}\right) \tag{2} \end{equation} \] or, put differently, \[ \begin{equation} \varepsilon \approx 8e^{\varepsilon_{L}/2}\sqrt{\frac{\log\frac{4}{\delta}}{n}} \tag{3} \end{equation} \] That’s a bit harder to interpret, but that seems… useful? It is: let us see how much, with a couple examples.

Learning.

The first one is distribution learning, a.k.a. density estimation: you have \(n\) i.i.d. samples (one per user) from an unknown probability distribution \(\mathbf{p}\) over a discrete domain of size \(k\), and your goal is to output an estimate \(\widehat{\mathbf{p}}\) such that, with high (say, constant) probability, \(\mathbf{p}\) and \(\widehat{\mathbf{p}}\) are close in total variation distance: \[ \operatorname{TV}(\mathbf{p},\widehat{\mathbf{p}}) = \sup_{S\subseteq [k]} (\mathbf{p}(S) - \widehat{\mathbf{p}}(S) ) \leq \alpha \] (if total variation distance seems a bit mysterious, it’s exactly half the \(\ell_1\) distance between the probability mass functions). We know how to solve this problem in the non-private setting: \(n=\Theta\left( \frac{k}{\alpha^2} \right)\) samples are necessary and sufficient. We know how to solve this problem in the (central) DP setting: \(n=\Theta\left( \frac{k}{\alpha^2} + \frac{k}{\alpha\varepsilon} \right)\) samples are necessary and sufficient [DHS15]. We know how to solve this problem in the LDP setting: \begin{equation} n=\Theta\left(\frac{k^2}{\alpha^2(e^\varepsilon-1)^2}+\frac{k^2}{\alpha^2e^\varepsilon}+\frac{k}{\alpha^2}\right) \tag{4} \end{equation} samples are necessary and sufficient [ASZ19] (note that the first term is just \(k/(\alpha^2\varepsilon^2)\) for small \(\varepsilon\)). Now, as they say in Mulan: let’s make a shuffle DP algo out of you.

If we want to achieve \((\varepsilon,\delta)\)-shuffle DP, we need to select \(\varepsilon_L\). Based on (2) and (3), and ignoring pesky constants we will choose it so that \begin{equation} \varepsilon_{L} \approx \varepsilon \sqrt{\frac{n}{\log(1/\delta)}} \quad\text{ or }\quad e^{\varepsilon_{L}} \approx \varepsilon^2 \cdot \frac{n}{\log(1/\delta)}\,. \tag{5} \end{equation} depending on whether \(\frac{\varepsilon^2 n}{\log(1/\delta)}\geq 1\). Plugging that back in (4), we see that the first case corresponds to the first term (small \(\varepsilon_{L}\)) and the second to the second term (\(\varepsilon_{L} \geq 1\)), and overall the condition on \(n\) for the original LDP algorithm to successful learn the distribution becomes \[ n \gtrsim \frac{k^2}{\alpha^2(e^{\varepsilon_{L}}-1)^2}+\frac{k^2}{\alpha^2e^{\varepsilon_{L}}}+\frac{k}{\alpha^2} \approx \frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k}{\alpha^2} \approx \frac{k^2\log(1/\delta)}{\alpha^2\varepsilon^2 n}+\frac{k}{\alpha^2} \] (where \(\gtrsim\) means “let’s ignore constants”). There is an \(n\) in the RHS as well, so reorganizing and handling the two terms separately the condition on \(n\) becomes \[ n \gtrsim \frac{k \sqrt{\log(1/\delta)}}{\alpha\varepsilon}+\frac{k}{\alpha^2} \] which… is great? We immediately get a sample complexity \(O\left(\frac{k}{\alpha^2}+\frac{k \sqrt{\log(1/\delta)}}{\alpha\varepsilon}\right)\) in the shuffle DP model, which (ignoring the \(\sqrt{\log(1/\delta)}\)) matches the one in the central DP setting!

tl;dr: Taking an optimal LDP algorithm and just shuffling the messages immediately gives an optimal shuffle DP algorithm, no extra work needed.

(Uniformity) Testing.

Alright, maybe it was a fluke? Let’s look at another “basic” problem close to my heart: we don’t want to learn the probability distribution \(\mathbf{p}\), just test whether it is actually the uniform distribution⁴ \(\mathbf{u}\) on the domain \([k]={1,2,\dots,k}\). So if \(\mathbf{p} =\mathbf{u}\), you’ve got to say “yes” with probability at least \(2/3\), and if \(\operatorname{TV}(\mathbf{p},\mathbf{u})>\alpha\), then you need to say “no” with probability at least \(2/3\).

This is also well understood in the non-private setting (\(n=\Theta(\sqrt{k}/\alpha^2)\)) [Paninski08] [see also my upcoming survey], in the central DP setting (\(n=\Theta\left( \frac{\sqrt{k}}{\alpha^2} + \frac{\sqrt{k}}{\alpha\sqrt{\varepsilon}}+\frac{k^{1/3}}{\alpha^{4/3}\varepsilon^{2/3}} + \frac{1}{\alpha\varepsilon} \right)\)) [ASZ18, ADR18], and in the LDP setting, where the result differs on whether the users can communicate or share a common random seed \begin{equation} n=\Theta\left( \frac{k}{\alpha^2(e^\varepsilon-1)^2} + \frac{k}{\alpha^2e^{\varepsilon/2}} + \frac{\sqrt{k}}{\alpha^2}\right) \tag{6} \end{equation} or not \begin{equation} n=\Theta\left( \frac{k^{3/2}}{\alpha^2(e^\varepsilon-1)^2} + \frac{k^{3/2}}{\alpha^2e^{\varepsilon}} + \frac{\sqrt{k}}{\alpha^2}\right) \tag{7} \end{equation} as established in a sequence of papers [ACT19, AJM20, ACFST21, ACLST22, CL22].

Now, say you want an (\(\varepsilon,\delta)\)-shuffle DP algorithm for uniformity testing, but don’t want to design one from scratch (though it is possible to do so, and some did [BCJM21, CL22, CY21]). Let’s say you want to look at the “no-common-random-seed-shared-by-users” model (a.k.a. private-coin setting): so you stare at the corresponding LDP communication complexity, (7), and try to choose \(\varepsilon_L\) to start with before shuffling. This will be the same as in the learning example (i.e., (5)): based on (2) and (3), we will set \begin{equation} \varepsilon_{L} \approx \varepsilon \sqrt{\frac{n}{\log(1/\delta)}} \quad\text{ or }\quad e^{\varepsilon_{L}} \approx \varepsilon^2 \cdot \frac{n}{\log(1/\delta)}\,. \end{equation} depending on whether \(\frac{\varepsilon^2 n}{\log(1/\delta)}\geq 1\). Plugging this back in (7) and quickly checking which case corresponds to each term, we then easily get that for our algorithm to correctly solve the uniformity testing problem, it suffices that the sample complexity (number of users) \(n\) satisfies \[ n \gtrsim \frac{k^{3/2}}{\alpha^2(e^{\varepsilon_L}-1)^2} + \frac{k^{3/2}}{\alpha^2e^{\varepsilon_L}} + \frac{\sqrt{k}}{\alpha^2} \approx \frac{k^{3/2}\log(1/\delta)}{\alpha^2\varepsilon^2 n } + \frac{\sqrt{k}}{\alpha^2} \] which, reorganizing and solving for \(n\), means that it suffices to have \[ n \gtrsim \frac{k^{3/4}\sqrt{\log(1/\delta)}}{\alpha\varepsilon} + \frac{\sqrt{k}}{\alpha^2}\,. \] And, voilà! Even better, we also have strong evidence to suspect that this sample complexity \(O\Big(\frac{k^{3/4}\sqrt{\log(1/\delta)}}{\alpha\varepsilon}+ \frac{\sqrt{k}}{\alpha^2}\Big)\) is tight among all private-coin algorithms.⁵

Now, if you wanted to look at public-coin shuffle DP protocols (with a common random seed available), then you would start with an optimal public-coin LDP algorithm (and look at (6)), and setting \(\varepsilon_L\) the same way you’d get a shuffle DP algorithm with sample complexity \[ n=O\Big(\frac{k^{2/3}\log^{1/3}(1/\delta)}{\alpha^{4/3}\varepsilon^{2/3}} + \frac{\sqrt{k\log(1/\delta)}}{\alpha\varepsilon}+ \frac{\sqrt{k}}{\alpha^2}\Big) \] which, well, is also strongly believed to be optimal!

tl;dr: Here again, taking an optimal off-the-shelf LDP algorithm and just shuffling the messages immediately gives an optimal shuffle DP algorithm, no extra work needed.

Conclusion.

I hope the above convinced you of how useful this privacy amplification can be: from an optimal LDP algorithm, featuring any extra appealing characteristics you like, just adding an extra shuffling step as postprocessing yields an (often optimal? At least good) shuffle DP algorithm, with the same characteristics and built-in robustness against malicious users.

All you need is to make sure that your starting point, the LDP algorithm satisfies a couple things: (1) all users have the same randomizer,⁶ and (2) it works in all regimes of \(\varepsilon\) (both high-privacy, \(\varepsilon \leq 1\), and low-privacy, \(\varepsilon \gg 1\)). Once you’ve got this, Bob’s your uncle! You get shuffle DP algorithms for free.

It is not only appealing from a theoretical point of view, by the way! The authors of the paper worked hard to make their empirical analysis compelling as well, and their code is available on GitHub 📝. But more importantly, from a practitioner’s point of view, this means it is enough to design, implement, and test one algorithm (the LDP one we start with) to automatically get a trusted one in the shuffle DP model as well: this reduces the risks of bugs, security failures, the amount of work spending tuning, testing…

So yes, whenever possible, we should hide among the clones!

The title of this post is a reference to the title of [FMT21], “Hiding Among The Clones,” and to the notion of privacy blanket introduced by Balle, Bell, Gascón, and Nissim [BBGN19]. Intuitively, the “amplification by shuffling” paradigm can be seen as anonymizing the messages from local randomizers, whose message distribution can be mathematically decomposed as a mixture of “noise distribution not depending on the user’s input” and “distribution actually depending on their input.” As a result, each user randomly sends a message from the first or second distribution of the mixture. But the shuffling then hides the informative messages (drawn from the second part of the mixture) among the non-informative (noise) ones: so the noise messages end up providing a “privacy blanket” in which sensitive information is safely and soundly wrapped. ↩
More specifically, they can completely jeopardize the utility (accuracy) of the result, but in terms of privacy, all they can do is slightly reduce it: if \(10\%\) of users are malicious, the remaining \(90\%\) still get the privacy amplification of guarantee of (1), but with \(0.9n\) instead of \(n\). ↩
Of course, we started with a local privacy guarantee, and ended up with a shuffle privacy guarantee: so the two are incomparable, and one has to interpret this “amplification” in that context. ↩
You can here replace uniform by any known distribution \(\mathbf{q}\) of your choosing, that doesn’t change the question (and result), but uniform is nice. ↩
As long as one is happy with approximate DP. One can achieve that in pure DP as well, but it’s a bit more complicated [CY21]. ↩
This is not such a big assumption usually, and there are somewhat-general ways to get to that using a logarithmic factor in the number of users. ↩