Google open sources method to join datasets without gatecrashing privacy

Google open sources method to join datasets without gatecrashing privacy

Google has open sourced a method for secure multi-party computation that it reckons will allow organisations to work on confidential data sets while keeping individuals’ details encrypted – potentially making machine learning less of a privacy nightmare.

Private Join and Compute builds on the principles behind the Password Checkup extension Google released earlier this year, which relies on the private set intersection cryptographic protocol.

It aims to solve the question of “How can one party gain aggregated insights about the other party’s data without either of them learning any information about individuals in the datasets?”

Two technologies are used. Private set intersection allows two parties to join their data sets and discover common identifiers by using an oblivious variant which only marks encrypted identifiers without learning any of the identifiers. This is combined with homomorphic encryption, which “allows certain types of computation to be performed directly on encrypted data without having to decrypt it first, which preserves the privacy of raw data”.

Google says “This combination of techniques ensures that nothing but the size of the joined set and the statistics (e.g. sum) of its associated values is revealed. Individual items are strongly encrypted with random keys throughout and are not available in raw form to the other party or anyone else.”

Google claims that  “two parties can encrypt their identifiers and associated data, and then join them. They can then do certain types of calculations on the overlapping set of data to draw useful information from both datasets in aggregate.” But throughout, identifiers and associated data remain “fully encrypted and unreadable.”

According to Google’s paper on the topic, the research was prompted by the question of how to compute the aggregate conversion rate (or effectiveness) of advertising campaigns – a subject close to Google’s heart.

Google said it is exploring use cases including user security, obviously, aggregated ads measure, also obviously, and collaborative machine learning. It postulated applications in areas where there are particular sensitivities about revealing details on individuals represented in data, such as public police, or tracking diversity and inclusion. And of course, healthcare.

You can access the protocol, and the heavyweight paper detailing the work via Google’s security blog here.