Some Thoughts on Differential Privacy
Differential privacy is a popular framework for protecting privacy when collecting and analyzing data. In simple terms, it involves adding noise to a dataset to protect individual privacy while still allowing for valuable insights to be extracted. The level of noise dictates the level of privacy and accuracy.
The framework is defined for randomized algorithms to account for mechanisms such as noise addition. The definition is usually presented in terms of the probability of outputs and how they are related for similar datasets. Simply stated, the relative error between probabilities of outputs of two datasets which differ by just one value that they map to the same range should be less than a threshold (epsilon — also referred to as the privacy budget)
Definition: Let D and D’ be two datasets that differ by just one value. Then, the randomized algorithm F is ε -differentially private if, for all subsets of the range, R, we have
where,
This is also stated more commonly with exp instead and having D, D’ on separate sides of the inequality. But this version of the definition really highlights the word ‘differential’ in this topic.
One of the most common choices to create differentially private algorithms is the Laplace mechanism. It works by adding noise to individual data sampled from the Laplace distribution.
For example, in a dataset containing individual salaries xᵢ, adding noise via this mechanism and retrieving the average ensures it is differentially private.
Here C stands for the maximum salary in the set. What this means is, even if someone else gets access to the salary value, the individual can deny that their data was used in the computation. This is because the chance of the output being the stated value can be made arbitrarily close to the one without their data, and is commonly known as ‘plausible deniability’.
Ideally, one must add noise to the data and store it to ensure the above notion holds true. It is also possible to add noise on-the-fly during retrieval, which has its pros and cons, and is something that will be discussed subsequently.
One thing to note is that even if the Laplace noise is clipped to a specified range [a,b], the algorithm can be shown to be differentially private [1], albeit for a relaxed version known as (ε, δ) — differentially private, which would still ensure plausible deniability, but the rate of convergence would be different. Since this holds for clipped noise, the noise term can be made to go to zero due to the (1/n) term for very large datasets. This also is a motivating factor for large companies to embrace differential privacy.
Most implementations, however, use a simple Laplace mechanism and this carries what is commonly known as “tail risk”, a recurring theme in machine learning as well. For example, classifiers failing to detect certain out-of-distribution data can snowball into a self-driving car crash. Similarly, tail risk in differential privacy is a situation where simple Laplace noise wipes out someone’s bank account irrespective of the scaling used for it since it is unbounded. Either a clipped mechanism must be used in production systems or the risks are clearly stated when a simple version is used.
Differential privacy also presents a paradigm shift in terms of data storage, which would mean data is “never consistent”, going from the usual notions of strongly and eventually consistent. If a company is using differential privacy, it must clearly state in its policies that the data can never be completely consistent. This is indeed the case, for example, with Apple, which has an elaborate document presenting details even on the privacy budget apart from just being transparent about the compromise on accuracy.
A lot of this can be circumvented by preserving exact data and using on-the-fly computation to preserve privacy. This is trivial to achieve on the user side and is also commonly known as “local differential privacy”. On the server side, however, this would open up the risk of exact data being intercepted, but that can always be done under more fortified networks. Also, on-the-fly noise addition would allow variable noise levels to be added and that can support many other operations, not just average, under differential privacy.
Differential privacy is also seen in more complex settings such as during the training process in machine learning. For example, using data with Laplace noise added in your training batches would lead to the overall training algorithm being differentially private [2]. The compromise on accuracy has to be ascertained empirically however for specific datasets and networks, as it might not be straightforward to quantify the uncertainty in this case.
Queries handling non-numerical data can also be made differentially private through the use of hash functions that preserve distance to convert them to numerical data and then using similar mechanisms for differential privacy.
Differential privacy offers a convenient “middle ground” compared to other privacy-preserving methods. The above table summarizes various methods and the tradeoffs related to computational cost, accuracy and privacy. Note that the rankings in each cell are subjective and may vary depending on the specific use case and implementation details.
In terms of computational cost, differential privacy is much more straightforward compared to its competitors, which require several layers of encryption (homomorphic computation) or multiple communication costs (PIR and MPC) and retrieving additional data (PIR). Privacy guarantees are also stronger for other techniques where either which data being accessed is not known (PIR) or multiple parties perform computations without revealing them (MPC) or even doing the entire process without decryption (homomorphic computation). This is in contrast to differential privacy, which provides plausible deniability, but might be sufficient for several use cases.
Acknowledgements
[1] Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407.
[2] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 308–318).
I would also like to thank Jovian for hosting a meetup on topics in machine learning. Do check out their platform for exploring related technical skills!