Google open sources their differential privacy library to help protect user’s private data

Yesterday, tending on the importance of strong privacy protections in firms, Google open-sourced a differential privacy library which is used by them in their core products. Their approach is an end-to-end implementation of differentially private query engine and is generic and scalable. Basically, developers can use this library to build tools that can work with aggregate data without revealing personally identifiable information.

According to Miguel Guevara, the product manager of privacy and data protection at Google, “Differentially-private data analysis is used by an organization to sort through the majority of their data and safeguard them in such a way that no individual’s data is distinguished or re-identified. This approach can be used for various purposes like focusing on features that can be particularly difficult to execute from scratch.”

Google differential privacy library differentiates private aggregations on databases, even when individuals can each be associated with arbitrarily many rows. The company has been using the differential privacy algorithm to create supportive features like “how busy a business is over the course of a day or how popular a particular restaurant’s dish is in Google Maps, and improve Google Fi” says Guevara in the official blog post.

Google researchers have published their findings in a research paper. The paper describes a C++ library of ε-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information. The researchers have also provided a stochastic tester to check the correctness of the algorithms.

One of the researchers explains the motive behind this library on Twitter. He says, “The main focus of the paper is to explain how to protect *users* with differential privacy, as opposed to individual records. So much of the existing literature implicitly assumes that each user is associated to only one record. It's rarely true in practice!”

Key features of the differential privacy library

Statistical functions: The library can be used by developers to compute Count, Sum, Mean, Variance, Standard deviation, and Order statistics (including min, max, and median).

Rigorous testing: The differential privacy library includes a manual and extensible stochastic testing. The stochastic framework produces a database depending on the result of differential privacy. It contains four components such as database generation, search procedure, output generation, and predicate verification. The researchers have open-sourced the ‘Stochastic Differential Privacy Model Checker library’ for reproducibility.

Ready to use: The differential privacy library uses the common Structured Query Language (SQL) extension which can capture most data analysis tasks based on aggregations.

Modular: The differential privacy library has been extended to include other functionalities such as additional mechanisms, aggregation functions, or privacy budget management. It can also be extended to handle end-to-end user-level differential privacy testing.

How does the differentially private SQL work with bounded user contribution

The Google researchers have implemented the differential privacy (DP) query engine on a collection of custom SQL aggregation operators and a query rewriter. The SQL engine tracks the user ID metadata to invoke the DP query rewriter and the query rewriter is used to perform anonymization semantics validation and enforcement.

The query rewriter then classifies the queries into two steps. The first step validates the table subqueries and the second step samples the fixed number of partially aggregated rows for each user. This step assists in limiting the user contribution across partitions. Finally, the system computes a cross-user DP aggregation which contributes to each GROUP BY partition and limits the user contribution within partitions. The paper states, “Adjusting query semantics is necessary to ensure that, for each partition, the cross-user aggregations receive only one input row per user.”

In this way, the developed differentially private SQL system captures most of the data analysis tasks using aggregations. The mechanisms implemented in the system uses a stochastic checker to prevent regression and increase the quality of privacy guaranteed.

Though the algorithms presented in the paper are simple, the researchers maintain that based on the empirical evidence the approach is useful, robust and scalable. In the future, researchers are hoping to see usability studies to test the success of the methods. In addition, they see room for significant accuracy improvements, using Gaussian noise and better composition theorems.

Many developers have appreciated that Google open-sourcedopen sourced its differential privacy library for others.

https://twitter.com/_rickkelly/status/1169605755898515457

https://twitter.com/mattcutts/status/1169753461468086273

In contrast, many people on Hacker News are not impressed with Google’s initiative and feel that they are misleading users with this announcement.

One of the comments read, “Fundamentally, Google's initiative on differential privacy is motivated by a desire to not lose data-based ad targeting while trying to hinder the real solution: Blocking data collection entirely and letting their business fail.

In a world where Google is now hurting content creators and site owners more than it is helping them, I see no reason to help Google via differential privacy when outright blocking tracking data is a viable solution.”

You can check out the differential privacy Github page and the research paper for more information on Google’s research.