Differentially Private Algorithms that Never Fail

Most differentially private algorithms fail with some nonzero probability. For example, when adding Gaussian or Laplace noise, there is some chance that the noise deviates significantly from its mean. But, fortunately, large deviations are unlikely. In this post we’re going to take a closer look at failure modes of DP algorithms and we’ll present some generic methods for reducing – or even eliminating – the failure probability.

Let’s be precise about what we mean by failure probability: Let’s assume we have a \((\varepsilon,\delta)\)-differentally private algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) and we have a loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\).¹ The (worst-case)² failure probability \(\beta\) of \(M\) is \[\beta := \max_{x\in\mathcal{X}^n} \mathbb{P}[\ell(M(x),x)>\alpha],\tag{1}\] where \(\alpha\) is some target value for the loss.³

For example, if \(M(x)=q(x)+\mathsf{Laplace}(1/\varepsilon)\) is the Laplace mechanism and \(\ell(y,x)=|y-q(x)|\) is the absolute error, then the failure probability is the tail probability \(\beta = \exp(-\varepsilon\alpha)\). If we want to eliminate the failure probability, we could use truncated Laplace noise instead of regular Laplace noise.⁴ And – spoiler alert – that’s the kind of method we’re going to look at.

To be clear, in this post we’re talking about failures of utility, which are different from failures of privacy. In a previous post, we talked about privacy failures; roughly, the \(\delta\) in \((\varepsilon,\delta)\)-DP captures the probability of a privacy failure. Privacy failures are a lot harder to fix than utility failures (which is kinda the point of this post).

Here’s our problem: We’re given a DP algorithm \(M\) with failure probability \(\beta\), and we want to modify the algorithm to get a new DP algorithm \(\widetilde{M}\) with failure probability \(\widetilde{\beta}<\beta\). Ideally, we want \(\widetilde{\beta}=0\).

Warmup: Absorbing the failure probability into \(\delta\)

Let’s start with a simple trick to get zero failure probability. This trick should hopefully give you some intuition for why it’s even possible to have zero failure probability under DP.

Suppose that, in addition to the \((\varepsilon,\delta)\)-DP algorithm \(M\) with failure probability \(\beta=\max_x\mathbb{P}[\ell(M(x),x)>\alpha]\), we have a non-private algorithm \(\check{M} : \mathcal{X}^n \to \mathcal{Y}\) that never fails – i.e., \(\max_x \mathbb{P}[\ell(\check{M}(x),x)>\alpha]=0\).⁵

Now let’s define \(\widetilde{M}(x)\) as follows. First, compute \(y=M(x)\). If \(\ell(y,x)\le\alpha\), return \(y\). If \(\ell(y,x)>\alpha\), compute \(\check{y}=\check{M}(x)\) and return \(\check{y}\).¹

Clearly \(\widetilde{M}\) now has zero failure probability. What about privacy?

Fix arbitrary neighbouring \(x,x’\in\mathcal{X}^n\) and a measurable \(S\subset\mathcal{Y}\). Define \(S^* := \{ y \in S : \ell(y,x)\le\alpha \}\). Now we have \[ ~~~~~~~~~~~~~~~~~~~~~\mathbb{P}[\widetilde{M}(x)\in S] = \mathbb{P}[M(x)\in S^*] + \mathbb{P}[\ell(M(x),x)>\alpha] \cdot \mathbb{P}[\check{M}(x)\in S]\] \[ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\le e^\varepsilon \mathbb{P}[M(x’)\in S^*] + \delta + \mathbb{P}[\ell(M(x),x)>\alpha] \cdot 1 \] \[ \le e^\varepsilon \mathbb{P}[\widetilde{M}(x’)\in S] + \delta + \beta. \tag{2} \] Thus \(\widetilde{M}\) is \((\varepsilon,\delta+\beta)\)-DP. In other words, we’ve absorbed the utility failure probability \(\beta\) into the privacy failure probability \(\delta\).²

This trick is neat since it lets us eliminate one of the parameters (\(\beta\)), but, in practice, you might not want to do this. We’re swapping a utility failure for a privacy failure and that often isn’t a great trade.

This trick only works if you already have a small failure probability \(\beta\).⁶ What if we start with a large failure probability, say, \(\beta=0.1\) or even \(\beta=0.9\)? We can amplifiy the probability of a getting a successful result by running the algorithm multiple times. Naïvely, the privacy cost increases according to composition; plus we need to select one of the runs to output, which requires looking at the input. This is roughly what we will do next, but we will avoid composition (sort of).

Avoiding silent failures

Above, we non-privately checked the failure condition \(\ell(\check{M}(x),x)>\alpha\). Intuitively, using a non-private test must cost us a lot in terms of privacy. Thus, to do better, we have to rely on a private test of the failure condition.

We can’t do much with an arbitrary loss function, so we need to make some assumptions. First, we will assume the loss has sensitivity \(\le1\). Second, we will assume that there is some wiggle room in the loss threshold \(\alpha\). Specifically, while the original algorithm \(M\) guarantees loss \(\le\alpha\) with probability \(\ge1-\beta\), our modified algorithm will guarantee loss \(\le\widetilde{\alpha}:=\alpha+2\tau\), where \(\tau=O(\log(1/\delta)/\varepsilon)\).

Are these assumptions reasonable? First, if the loss is high-sensitivity, then we can apply tricks like inverse sensitivity to get a low-sensitivity loss. Second, we can contrast with the exponential mechanism, which guarantees loss \(\le\mathsf{OPT}+O(\log|\mathcal{Y}|/\varepsilon)\). Thus the wiggle room we’re asking for is comparable to (or better than) what we’d get from the exponential mechanism, assuming \(\delta\) isn’t super tiny – specifically, \(\log (1/\delta) \le O(\log|\mathcal{Y}|)\).

Now we can modify the algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) to also output an estimate of the loss using truncated Laplace noise. Call this new algorithm \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\). If \(M\) is \((\varepsilon,\delta)\)-DP, we can make \(\overline{M}\) satisfy \((\overline{\varepsilon}=2\varepsilon,\overline{\delta}=2\delta)\)-DP and guarantee that the error of the loss estimate is \(\le \tau = O(\log(1/\delta)/\varepsilon)\) with probability 1.⁷

The benefit of this modified DP algorithm \(\overline{M}\) is that it won’t fail silently. If the loss is high, we will know about it.

Conditioning on success

To recap, we have a \((\overline{\varepsilon}=2\varepsilon,\overline{\delta}=2\delta)\)-DP algorithm \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\) with the following properties. Let \(x \in \mathcal{X}^n\) be arbitrary. Then, for \( (Y,Z) \gets \overline{M}(x)\), we have \[\mathbb{P}[\ell(Y,x) \le \alpha]\ge 1-\beta ~~~~~\text{ and }~~~~~ \mathbb{P}[|Z-\ell(Y,x)|\le \tau]=1,\tag{3}\] where \(\tau=O(\log(1/\delta)/\varepsilon)\). It follows that \(\mathbb{P}[Z \le \alpha + \tau] \ge 1-\beta\) and that \(Z \le \alpha + \tau \implies \ell(Y,x) \le \alpha + 2\tau\) with probability 1.

Now we define our final algorithm \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\):

Function \(\widetilde{M}(x)\):

Repeat as long as necessary:

Compute \((y,z) \gets \overline{M}(x)\).

If \(z \le \alpha + \tau\), return \(y\) and halt. Otherwise continue.

In other words, the output of \(\widetilde{M}\) is the output of \(\overline{M}\) conditioned on the reported loss being \(\le \alpha + \tau\). In symbols, \(\mathbb{P}[\widetilde{M}(x)=y]=\mathbb{P}[Y=y\mid Z \le \alpha + \tau]\) for \( (Y,Z) \gets \overline{M}(x)\). The number of times the loop in \(\widetilde{M}\) runs is geometrically distributed with mean \(\le\frac{1}{1-\beta}\). Note that \(\widetilde{M}\) needs to know⁸ the utility threshold \(\alpha\) and, if for some reason this threshold is wrong, we could get an infinite loop!²

By construction, we have \(\mathbb{P}[\ell(\widetilde{M}(x),x) \le \alpha + 2\tau ] = 1\). That is, we have zero failure probability. Now, what about privacy?

Theorem 1. Let \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\) be defined as above. Assume \(\overline{M} : \mathcal{X}^n \to \mathcal{Y} \times \mathbb{R}\) is \((\overline\varepsilon,\overline\delta)\)-differentially private and, for all inputs \(x\in\mathcal{X}^n\), if \((Y,Z)\gets\overline{M}(x)\), then \(\mathbb{P}[Z\le\alpha+\tau]\ge1-\beta>0\).² Then \(\widetilde{M}\) satisfies \((\widetilde{\varepsilon},\widetilde{\delta})\)-differential privacy for \[\widetilde{\varepsilon}=2\overline{\varepsilon} - \log(1-\overline{\delta}/(1-\beta)) ~~~~\text{ and }~~~~ \widetilde{\delta}=\frac{\overline{\delta}}{1-\beta} .\tag{4}\]

Proof. Let \(x,x’\in\mathcal{X}\) be neighbouring inputs. Let \((Y,Z) \gets \overline{M}(x)\) and \((Y’,Z’) \gets \overline{M}(x’)\). The distribution of \(\widetilde{M}(x)\) is that of \(Y\) conditioned on \(Z \le \alpha + \tau\). Similarly, the distribution of \(\widetilde{M}(x’)\) is that of \(Y’\) conditioned on \(Z’ \le \alpha + \tau\). Let \(S \subset \mathcal{Y}\) be arbitrary but measurable. It suffices to show that \[\mathbb{P}[Y \in S \mid Z \le \alpha + \tau] \le e^{\widetilde{\varepsilon}} \mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \widetilde{\delta}.\] We have
\( \mathbb{P}[Y \in S \mid Z \le \alpha + \tau] \)
   \(= \frac{\mathbb{P}[Y \in S ~\&~ Z \le \alpha + \tau]}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( \le \frac{e^{\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau] + \overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( \le \frac{e^{\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau]}{e^{-\overline{\varepsilon}}(\mathbb{P}[Z’ \le \alpha + \tau]-\overline{\delta})} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( = \frac{e^{2\overline{\varepsilon}}\mathbb{P}[Y’ \in S ~\&~ Z’ \le \alpha + \tau]}{\mathbb{P}[Z’ \le \alpha + \tau]}\frac{\mathbb{P}[Z’ \le \alpha + \tau]}{\mathbb{P}[Z’ \le \alpha + \tau]-\overline{\delta}} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]}\)
   \( = e^{2\overline{\varepsilon}}\mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] \frac{1}{1-\overline{\delta}/\mathbb{P}[Z’ \le \alpha + \tau]} + \frac{\overline{\delta}}{\mathbb{P}[Z \le \alpha + \tau]} \)
   \( \le \frac{e^{2\overline{\varepsilon}}}{1-\overline{\delta}/(1-\beta)}\mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \frac{\overline{\delta}}{1-\beta} \)
   \( = e^{\widetilde{\varepsilon}} \mathbb{P}[Y’ \in S \mid Z’ \le \alpha + \tau] + \widetilde{\delta}.\) ∎

As long as \(1-\beta \ge \Omega(1)\), Theorem 1 gives \(\widetilde{\varepsilon} = 2\overline{\varepsilon} + O(\overline{\delta}) = 4\varepsilon + O(\delta)\) and \(\widetilde{\delta}=O(\overline{\delta})=O(\delta)\). Putting the pieces together, we have the following result.

Theorem 2 Let \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) have sensitivity 1 in its second argument and let \(\alpha,\beta,\varepsilon,\delta\in\mathbb{R}\) with \(0\le\beta<1-2\delta<1\). Let \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfy \((\varepsilon,\delta)\)-DP and assume \(\mathbb{P}[\ell(M(x),x) \le \alpha ] \ge 1-\beta \) for all \(x \in \mathcal{X}^n\).
Then there exists \(\widetilde{M} : \mathcal{X}^n \to \mathcal{Y}\) that is \((4\varepsilon-\log\left(1-\frac{2\delta}{1-\beta}\right),\frac{2\delta}{1-\beta})\)-DP and \(\mathbb{P}[\ell(\widetilde{M}(x),x) \le \alpha + 2\tau ] = 1\) for all \(x\in\mathcal{X}^n\), where \(\tau=O(\log(1/\delta)/\varepsilon)\) is the truncation threshold for \((\varepsilon,\delta)\)-DP truncated Laplace noise.⁴

Note that our utility failure probability \(\beta\) appears in the privacy parameters of Theorem 2.² This is a bit unintuitive, but we saw how it can happen earlier with the trick of absorbing the utility failure as a privacy failure. The dependence here is milder than before; e.g., we can start with a high utility failure probability, e.g. \(\beta=0.5\), and still get a low final privacy failure probability \(\widetilde{\delta}\le10^{-6}\).

Overall we pay a constant factor in the privacy parameters and suffer an additive increase in the loss in order to eliminate the failure probability. And (unlike the earlier trick) this is true even if the initial failure probability was quite large.

Conclusion

We’ve presented two methods for eliminating the failure probability from DP algorithms. The first method simply moves the failure from utility to privacy; this has obvious downsides. The second method avoids these downsides and is applicable even when the initial failure probability is large, but it blows up the privacy parameters by a multiplicative factor and requires some wiggle room in the loss. The second method is based on a result by Gupta, Ligett, McSherry, Roth, & Talwar [GLMRT10 Theorem 10.2].

In both cases, we crucially exploit the nonzero \(\delta\) in approximate \((\varepsilon,\delta)\)-DP. And one of the high-level take-home messages of this post is simply that \(\delta\) can absorb utility failures, in addition to privacy failures.

For simplicity, this post has focused on fully eliminating the failure probability. What if, instead, we just want to reduce it? Is \(\delta\) still crucial? No! The second method we presented works even with \(\widetilde{\delta}=0\) or with Rényi DP; but we cannot entirely eliminate the failure probability. The math gets messier, but the high-level idea is pretty simple: Instead of using truncated Laplace noise, we use regular Laplace noise (to avoid nonzero \(\delta\)). This means there’s a chance that \(\overline{M}\) falsely reports low loss, which means there’s a chance of failure. But, as long as the chance of falsely reporting a low loss is much smaller than the chance of correctly reporting a low loss, the overall failure probability is low.

If you want to learn more about extensions of the second method, read the papers of Liu & Talwar [LT19], Papernot & Steinke [PS22], and Cohen, Lyu, Nelson, Sarlós, & Stemmer [CLNSS23]. These methods are particularly useful in settings where the initial success probability is low, e.g. \(1-\beta=0.01\), such as when there is some element of random guessing involved.

The other key take-home message of this post is that the failure probability shouldn’t be a first-order concern, at least from a theoretical perspective. In particular, if we obtain bounds on the expected error, then we can obtain high-probability bounds via this method.⁹

In many cases the reductions we presented are not practical; it’s usually easier to directly modify the algorithm to reduce the failure probability. However, the fact that these generic methods exist offers an explanation for why, in practice, failure probabilities are relatively easy to manage.

Note that we’re implicitly assuming that the loss \(\ell\) is known – i.e., it is something we can compute when designing algorithms. In particular, the loss must be an empirical loss, rather than a population loss. ↩ ↩²
It’s important that we have a provable worst-case failure probability bound for the original algorithm \(M\), since we want a provable privacy guarantee. In particular, if we only have a heuristic that works for most inputs \(x\), but fails badly on other inputs, then we cannot get a provable DP guarantee using these methods. It is possible that heuristics can be modified to fail gracefully and thus these methods can be salvaged, but that’s beyond the scope of this post. ↩ ↩² ↩³ ↩⁴ ↩⁵
For simplicity, in this post we will (mostly) talk about failure as a boolean event; i.e., there is a hard utility threshold at \(\alpha\). Of course, in most cases, there is not a hard threshold and it makes sense to talk about the tail probability \(\beta\) as a function of the threshold \(\alpha\), rather than a single value. Note that we look at the worst-case over inputs \(x\); that is, we aren’t in a statistical setting where inputs are random and we aren’t considering (non-private) statical errors. ↩
To achieve \((\varepsilon,\delta)\)-DP with \(0 < \delta \le \frac{1}{2}\), we can use Laplace noise truncted to magnitude \(\tau = \frac{1+\log(1/2\delta)}{\varepsilon} = O(\log(1/\delta)/\varepsilon)\). Truncated Laplace noise is folklore [L16]; Holohan et al. [HABA18] give a sharp analysis. ↩ ↩²
Note that we can always just define \(\check{M}(x) = \mathsf{argmin}_y \ell(y,x)\) or we can re-run \(M\) until we achieve the desired loss. ↩
Recall, that the privacy failure probablity should be tiny – e.g., \(\delta \le 10^{-6}\) – for the privacy guarantee to be compelling. ↩
For simplicity, we’re setting the privacy parameters of the truncated Laplace to be the same as for \(M\). In practice, this might be excessive and a different balance would work better. Also, some algorithms naturally output an estimate of their error and so this modification may not be necessary. ↩
If the utility threshold \(\alpha\) isn’t known (e.g., if it depends on the input \(x\)), then there are other methods than can be used [LT19,PS22,CLNSS23], but this is beyond the scope of this blog post. ↩
To be precise, if we have \(\mathbb{E}[\ell(M(x),x)]\le \alpha_*\) and \(\ell(y,x)\ge0\), then Markov’s inequality gives \(\mathbb{P}[\ell(M(x),x)\le\alpha]\ge1-\frac{\alpha}{\alpha_*}\) for all \(\alpha\ge\alpha_*\). We can plug this bound into Theorem 2 to get a high-probability bound. ↩

Posted by Xin Lyu and Thomas Steinke on March 9, 2025.

Categories: Algorithms

[cite this]

Subscribe to updates from DifferentialPrivacy.org by Email, on Bluesky or Twitter, or via RSS.