Limits of Privacy Amplification by Subsampling

In our previous post we gave a brief introduction to privacy amplification by subsampling. The high-level story is that we can make differentially private algorithms faster by runninng them on a subsample of the dataset instead of the whole dataset and this comes at essentially no cost in privacy and accuracy. That story is pretty good. But now we’ll take a closer look at the details of this story.

Privacy Amplification by Subsampling

Privacy Amplification by Subsampling is an important property of differential privacy. It is key to making many algorithms efficient – particularly in machine learning applications. Thus a lot of work has gone into analyzing this phenomenon. In this post we will give a quick introduction to privacy amplification by subsampling and its applications. In a follow-up post, we’re going to look at the limitations of privacy amplification by subsampling – i.e., when it doesn’t quite live up to the promises.

Differentially Private Algorithms that Never Fail

Most differentially private algorithms fail with some nonzero probability. For example, when adding Gaussian or Laplace noise, there is some chance that the noise deviates significantly from its mean. But, fortunately, large deviations are unlikely. In this post we’re going to take a closer look at failure modes of DP algorithms and we’ll present some generic methods for reducing – or even eliminating – the failure probability.

Tight RDP & zCDP Bounds from Pure DP

There are multiple ways to quantify differential privacy, including pure DP [DMNS06], approximate DP [DKMMN06], Concentrated DP [DR16,BS16], Rényi DP [M17], Gaussian DP [DRS19], & function-DP [DRS19]. Fortunately, these definitions are similar enough that we can convert between most of them (with some loss in parameters).

NeurIPS 2023 Outstanding Paper: Privacy auditing in just one run

NeurIPS 2023 just wrapped up, and one of the two outstanding paper awards went to Privacy Auditing with One (1) Training Run, by Thomas Steinke, Milad Nasr, and Matthew Jagielski. The main result of this paper is a method for auditing the (differential) privacy guarantees of an algorithm, but much faster and more practically than previous methods. In this post, we’ll dive into what this all means.

Open problem(s) - How generic can composition results be?

The composition theorem is a cornerstone of differential privacy literature. In its most basic formulation, it states that if two mechanisms $\mathcal{M}_1$ and $\mathcal{M}_2$ are respectively $\varepsilon_1$-DP and $\varepsilon_2$-DP, then the mechanism $\mathcal{M}$ defined by $\mathcal{M}(D)=\left(\mathcal{M}_1(D),\mathcal{M}_2(D)\right)$ is $(\varepsilon_1+\varepsilon_2)$-DP. A large body of work focused on proving extensions of this composition theorem. These extensions are of two kinds.

Beyond Local Sensitivity via Down Sensitivity

In our previous post, we discussed local sensitivity and how we can get accuracy guarantees that scale with local sensitivity, which can be much better than the global sensitivity guarantees attained via standard noise addition mechanisms. In this post, we will look at what we can do when even the local sensitivity is unbounded. This is obviously a challenging setting, but it turns out that not all hope is lost.

Beyond Global Sensitivity via Inverse Sensitivity

The most well-known and widely-used method for achieving differential privacy is to compute the true function value $f(x)$ and then add Laplace or Gaussian noise scaled to the global sensitivity of $f$. This may be overly conservative. In this post we’ll show how we can do better.

Covariance-Aware Private Mean Estimation, Efficiently

Last week, the Mark Fulk award for best student paper at COLT 2023 was awarded to the following two papers on private mean estimation:

Call for Papers - TPDP 2023 - Submission deadline July 7

The 9th Workshop on the Theory and Practice of Differential Privacy (TPDP 2023) will take place in Boston September 27-28, 2023. This is the first year the workshop is a standalone event. However, the OpenDP community meeting is the following day (also in Boston). It is also moving from a one-day event to two days.

Open problem - Better privacy guarantees for larger groups

Consider a simple query counting the number of people in various mutually exclusive groups. In the differential privacy literature, it is typical to assume that each of these groups should be subject to the same privacy loss: the noise added to each count has the same magnitude, and everyone gets the same privacy guarantees. However, in settings where these groups have vastly different population sizes, larger populations may be willing to accept more error in exchange for stronger privacy protections. In particular, in many use cases, relative error (the noisy count is within 5% of the true value) matters more than absolute error (the noisy count is at a distance of at most 100 of the true value). This leads to a natural question: can we use this fact to develop a mechanism that improves the privacy guarantees of individuals in larger groups, subject to a constraint on relative error?

Composition Basics

Our data is subject to many different uses. Many entities will have access to our data and those entities will perform many different analyses that involve our data. The greatest risk to privacy is that an attacker will combine multiple pieces of information from the same or different sources and that the combination of these will reveal sensitive details about us. Thus we cannot study privacy leakage in a vacuum; it is important that we can reason about the accumulated privacy leakage over multiple independent analyses, which is known as composition. We have previously discussed why composition is so important for differential privacy.

This is the first in a series of posts on composition in which we will explain in more detail how compositoin analyses work.

Composition is quantitative. The differential privacy guarantee of the overall system will depend on the number of analyses and the privacy parameters that they each satisfy. The exact relationship between these quantities can be complex. There are various composition theorems that give bounds on the overall parameters in terms of the parameters of the parts of the system.

The simplest composition theorem is what is known as basic composition, which applies to pure $\varepsilon$-DP (although it can be extended to approximate $(\varepsilon,\delta)$-DP):

Theorem (Basic Composition) Let $M_1, M_2, \cdots, M_k : \mathcal{X}^n \to \mathcal{Y}$ be randomized algorithms. Suppose $M_j$ is $\varepsilon_j$-DP for each $j \in [k]$. Define $M : \mathcal{X}^n \to \mathcal{Y}^k$ by $M(x)=(M_1(x),M_2(x),\cdots,M_k(x))$, where each algorithm is run independently. Then $M$ is $\varepsilon$-DP for $\varepsilon = \sum_{j=1}^k \varepsilon_j$.

Privacy Doona: Why We Should Hide Among The Clones

In this blog post, we will discuss a recent(ish) result of Feldman, McMillan, and Talwar [FMT21], which provides an improved and simple analysis of the so-called “amplification by shuffling” formally connecting local privacy (LDP) and shuffle privacy.¹ Now, I’ll assume the reader is familiar with both LDP and Shuffle DP: if not, a quick-and-dirty refresher (with less quick, and less dirty references) can be found here, and of course there is also Albert Cheu’s excellent survey on Shuffle DP [Cheu21].

The title of this post is a reference to the title of [FMT21], “Hiding Among The Clones,” and to the notion of privacy blanket introduced by Balle, Bell, Gascón, and Nissim [BBGN19]. Intuitively, the “amplification by shuffling” paradigm can be seen as anonymizing the messages from local randomizers, whose message distribution can be mathematically decomposed as a mixture of “noise distribution not depending on the user’s input” and “distribution actually depending on their input.” As a result, each user randomly sends a message from the first or second distribution of the mixture. But the shuffling then hides the informative messages (drawn from the second part of the mixture) among the non-informative (noise) ones: so the noise messages end up providing a “privacy blanket” in which sensitive information is safely and soundly wrapped. ↩

Differentially private deep learning can be effective with self-supervised models

Differential Privacy (DP) is a formal definition of privacy which guarantees that the outcome of a statistical procedure does not vary much regardless of whether an individual input is included or removed from the training dataset. This guarantee is desirable when we are tasked to train machine learning models on private datasets that should not memorize individual inputs. Past works have shown that differentially private models can be resilient to strong membership inference [1, 34, 35] and data reconstruction attacks [2, 3] when the privacy parameter is set to be sufficiently small. See a prior post for more background on differentially private machine learning.

A simple recipe for private synthetic data generation

In the last blog post, we covered the potential pitfalls of synthetic data without formal privacy guarantees, and motivated the need for differentially private synthetic data mechanisms. In this blog post, we will describe the select-measure-generate paradigm, which is a simple and effective template for designing synthetic data mechanisms. The three steps underlying the select-measure-generate paradigm are illustrated and explained below.

What is Synthetic Data?

The concept of synthetic data seems to be having “a moment” in the privacy world as a promising approach to sharing data while protecting privacy. Strictly speaking, any time you make up data, you have produced a synthetic dataset, but more specifically

Conference Digest - NeurIPS 2021

The accepted papers for NeurIPS 2021 were recently announced, and there’s a huge amount of differential privacy content. We found one relevant workshop and 48 papers. This is up from 31 papers last year, an over 50% increase! It looks like there’s huge growth in interest on differentially private machine learning. Impressively, at the time of this writing, all but five papers are already posted on arXiv! For the full list of accepted papers, see here. Please let us know if we missed relevant papers on differential privacy!

How to deploy machine learning with differential privacy?

In many applications of machine learning, such as machine learning for medical diagnosis, we would like to have machine learning algorithms that do not memorize sensitive information about the training set, such as the specific medical histories of individual patients. Differential privacy is a notion that allows quantifying the degree of privacy protection provided by an algorithm on the underlying (sensitive) data set it operates on. Through the lens of differential privacy, we can design machine learning algorithms that responsibly train models on private data.

One-shot DP Top-k mechanisms

In the last blog post, we showed that the exponential mechanism enjoys improved composition bounds over general pure DP mechanisms due to a property called bounded range. For this post, we will present another useful, and somewhat surprising, property of the exponential mechanism in its application of top-$k$ selection.

A Better Privacy Analysis of the Exponential Mechanism

A basic and frequent task in data analysis is selection – given a set of options $\mathcal{Y}$, output the (approximately) best one, where “best” is defined by some loss function $\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}$ and a dataset $x \in \mathcal{X}^n$. That is, we want to output some $y \in \mathcal{Y}$ that approximately minimizes $\ell(y,x)$. Naturally, we are interested in private selection – i.e., the output should be differentially private in terms of the dataset $x$. This post discusses algorithms for private selection – in particular, we give an improved privacy analysis of the popular exponential mechanism.

Open Problem - Optimal Query Release for Pure Differential Privacy

Releasing large sets of statistical queries is a centerpiece of the theory of differential privacy. Here, we are given a dataset $x = (x_1,\dots,x_n) \in [T]^n$, and a set of statistical queries $f_1,\dots,f_k$, where each query is defined by some bounded function $f_j : [T] \to [-1,1]$, and (abusing notation) is defined as \[ f_j(x) = \frac{1}{n} \sum_{i=1}^{n} f_j(x_i). \] We use $f(x) = (f_1(x),\dots,f_k(x))$ to denote the vector consisting of the true answers to all these queries. Our goal is to design an $(\varepsilon, \delta)$-differentially private algorithm $M$ that takes a dataset $x\in [T]^n$ and outputs a random vector $M(x)\in \mathbb{R}^k$ such that $\| M(x) - f(x) \|$ is small in expectation for some norm $\|\cdot\|$. Usually algorithms for this problem also give high probability bounds on the error, but we focus on expected error for simplicity.

Conference Digest - ICML 2021

ICML 2021, one of the biggest conferences in machine learning, naturally has a ton of interesting sounding papers on the topic of differential privacy. We went through this year’s accepted papers and aggregated all the relevant papers we could find. In addition, this year features three workshops on the topic of privacy, as well as a tutorial. As always, please inform us if we overlooked any papers on differential privacy.

Statistical Inference is Not a Privacy Violation

On April 28, 2021, the US Census Bureau released a new demonstration of its differentially private Disclosure Avoidance System (DAS) for the 2020 US Census. The public were given a month to submit feedback before the system is finalized. This demonstration data and the feedback has generated a lot of discussion, including media coverage on National Public Radio, in the Washington Post, and via the Associated Press. The DAS is also the subject of an ongoing lawsuit.

Call for Papers - Workshop on the Theory and Practice of Differential Privacy (TPDP 2021)

Work on differential privacy spans a number of different research communities, including theoretical computer science, machine learning, statistics, security, law, databases, cryptography, programming languages, social sciences, and more. Each of these communities may choose to publish their work in their own community’s venues, which could result in small groups of differential privacy researchers becoming isolated. To alleviate these issues, we have the Workshop on the Theory and Practice of Differential Privacy (TPDP), which is intended to bring these subcommunities together under one roof (well, a virtual one at least for 2020 and 2021).

ALT Highlights - An Equivalence between Private Learning and Online Learning (ALT '21 Tutorial)

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science. Given the topic of this post, we felt DifferentialPrivacy.org was a great fit. This initiative is organized by the Learning Theory Alliance, and overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

What is δ, and what δifference does it make?

There are many variants or flavours of differential privacy (DP) some weaker than others: often, a given variant comes with own guarantees and “conversion theorems” to the others. As an example, “pure” DP has a single parameter $\varepsilon$, and corresponds to a very stringent notion of DP:

Conference Digest - TPDP 2020

TPDP 2020 is a workshop focused on differential privacy. As such, it’s a great place to learn about recent developments in the DP research community. It will be held on 13 November and is co-located with CCS, but, of course, it’s virtual this year. Registration is only US$35 if you register by Friday, 30 October. Check out the 8 excellent talks and 71 posters below – wow, the workshop has grown!

Reconstruction Attacks in Practice

This is the second of two posts describing the theory and practice of reconstruction attacks. To read the first post, which covers the theoretical basis of such attacks, [click here].

The Theory of Reconstruction Attacks

We often see people asking whether or not differential privacy might be overkill. Why do we need strong privacy protections like differential privacy when we’re only releasing approximate, aggregate statistical information about a dataset? Is it really possible to extract information about specific users from releasing these statistics? The answer turns out to be a resounding yes! The textbook by Dwork and Roth [DR14] calls this phenomenon the Fundamental Law of Information Recovery:

Giving overly accurate answers to too many questions will inevitably destroy privacy.

Conference Digest - NeurIPS 2020

NeurIPS 2020 is the biggest conference on machine learning, with tons of content on differential privacy in many different forms. We were able to find two workshops, a competition, and 31 papers. This was just going off the preliminary accepted papers list, so it’s possible that we might have missed some papers on differential privacy – please let us know! We will update this post later, once all the conference material (papers and videos) are publicly available.

Open Problem - Avoiding the Union Bound for Multiple Queries

Background: Perhaps the best-studied problem in differential privacy is answering multiple counting queries. The standard approach is to add independent, appropriately-calibrated (Laplace or Gaussian) noise to each query result and apply a composition theorem. To bound the maximum error over the query answers, one takes a union bound over the independent noise samples. However, this is not optimal. The problem is to identify the optimal method (up to constant factors).

Differentially Private PAC Learning

The study of differentially private PAC learning runs all the way from its introduction in 2008 [KLNRS08] to a best paper award at the Symposium on Foundations of Computer Science (FOCS) this year [BLM20]. In this post, we’ll recap the history of this line of work, aiming for enough detail for a rough understanding of the results and methods.

Conference Digest - ICML 2020

ICML 2020 is one of the premiere venues in machine learning, and generally features a lot of great work in differentially private machine learning. This year is no exception: the relevant papers are listed below to the best of our ability, including links to the full versions of papers, as well as the conference pages (which contain slides and 15 minute videos for each paper). As always, please inform us if we overlooked any papers on differential privacy.

Conference Digest - COLT 2020

COLT 2020 was held online in July, and featured nine papers on differential privacy, as well as a keynote talk by Salil Vadhan. While differential privacy has always had a home in the COLT community, it seems like this year was truly exceptional in terms of the number of results. We link all the content below, including pointers to the papers, videos on Youtube, and the page on the conference website. Please let us know if we missed any papers on differential privacy, either in the comments below or by email.

Why Privacy Needs Composition

We’re back! In our last post we discussed some of the subtle pitfalls of formulating the assumptions underlying average-case relaxations of differential privacy. This time we’re going to look at the composition property of differential privacy—that is, the fact that running two independent differentially private algorithms on your data and combining their outputs is still differentially private. This is a key property of differential privacy and is actually closely related to the worst-case nature of differential privacy.

Open Problem - Private All-Pairs Distances

Background: Suppose we are interested in computing the distance between two vertices in a graph. Under edge or node differential privacy, this problem is not promising because the removal of a single edge can make distances change from 1 to $n − 1$ or can even disconnect the graph. However, a different setting that makes sense to consider is that of a weighted graph $(G, w)$ whose topology $G = (V, E)$ is publicly known but edge weight function $w : E \to \mathbb{R}^+$ must be kept private. (For instance, consider transit times on a road network. The topology of the road network may be publicly available as a map, but the edge weights corresponding to transit times may be based on private GPS locations of individual cars.)

The Pitfalls of Average-Case Differential Privacy

Differential privacy protects against extremely strong adversaries—even ones who know the entire dataset except for one bit of information about one individual. Since its inception, people have considered ways to relax the definition to assume a more realistic adversary. A natural way to do so is to incorporate some distributional assumptions. That is, rather than considering a worst-case dataset, assume the dataset is drawn from some distribution and provide some form of “average-case” or “Bayesian” privacy guarantee with respect to this distribution. This is especially tempting as it is common for statistical analysis to work under distributional assumptions.

Conference Digest - STOC 2020

STOC 2020 was recently held online, as one of the first major theory conferences during the COVID-19 era. It featured four papers on differential privacy, which we list and link below. Each one is accompanied by a video from the conference, as well as a longer video if available. Please let us know if we missed any papers on differential privacy, either in the comments below or by email.

Trust Models, and Notions of Privacy

There exist various notions of differential privacy which, while sharing a common core, differ in some key specific aspects. Broadly speaking, vary among a few main axes, such as the type of guarantee they provide, the specific similarity between data they consider, and the trust model they aim to address. This last point will be the focus of this post: which notion of privacy is best suited to the specific scenario at hand?

Welcome to DifferentialPrivacy.org!

Hello, welcome to this new website! Our goal is to serve as a hub for the differential privacy research community and to promote the work in this area. Please read on to learn more!