Blame it on misguided youth, blame it on greed, or blame it on just plain ignorance, but I have a confession to make: I spent the early part of my career working for a major investment bank.
To be honest, it was a great opportunity as a young Data Scientist to learn how to apply all the theory I’d learned in mathematics, statistics and computer science to ‘real world’ problems.
I learned a great deal about working with data, working with people, and how to survive long hours as a front office quant on a busy trading floor on coffee and free food…
One thing I also learned was the importance of one particular day in the year to an investment banker: bonus day!
This day is filled with both excitement and dread. Initially, you become excited to have a number revealed to you, that you’ve been trying to guess, and that can make up a decent portion of your remuneration. This is closely followed by anxiety and dread when you start to wonder how you rate compared to the rest of your team, and to the rest of the organisation.
The issue was, everyone wanted to know what everyone else got, but didn’t want to reveal their own amount!
My colleagues and I used to discuss how great it would be if there was a way to measure how we ranked, but without disclosing our individual numbers.
Well, now there’s a way…
Show Me The Money!
Let’s consider four employees, Alice, Bob, Jane and Sarah, who are all at the same level, potentially from different offices that are located in different cities, and who want to know what the average bonus is amongst them, but without disclosing their individual amounts.
As they’re a paranoid group, intent on preserving their privacy, we’ll assume an independent colleague, Pete, is responsible for calculating the actual average.
To ensure privacy, Pete won’t be privy to either individual names or individual bonus amounts.
So how can we do this?
The below diagram illustrates the key steps in the process:
To begin with, we apply Privacy-Preserving Data Matching (PPDM) to the names of the individuals in order to link up staff that are part of the same team, and at the same level, across the organisation. This ensures that their names are ‘masked’ so that Pete can’t identify them.
But, this only solves part of the privacy protection problem. Even though Pete can’t see the raw names anymore, he can still see the unmasked bonuses attached to each name. There are ways he can uncover who got what, by using a dictionary attack, for instance, or simply common sense. Say that Sarah was the most senior and experienced of the team, and had contributed the most to its success. Pete could easily presume that the highest bonus amount belongs to her.
To completely ensure privacy, we must also hide the bonus amounts. One way to do this is via Homomorphic Encryption (HE). This is different to normal encryption in that it has one special, magical feature – it allows us to do computations over encrypted values!
This means that Pete can now calculate the average of the encrypted bonus values as he would on the raw values. Once he decrypts this value, the actual average is revealed, which can then be shared with the four anxious team members, some of whom will walk away smiling and others crying…
This is all possible due to the concept of ‘Confidential Computing’, a term coined by Data61. We’ll now discuss in more detail the two important elements that make it possible, ie PPDM and HE.
Privacy-Preserving Data Matching
PPDM is simply a way to mask the data before linking it. Think of it as a fingerprint for your data, ie a unique identifier. It is a powerful technique that can be scaled to working across multiple databases.
Hash functions are typically used, which are one way functions that mathematically map data from any size to a fixed size.
Image courtesy of Stephen Hardy.
Homomorphic Encryption
HE is an incredibly powerful encryption method that allows us to perform mathematical operations on encrypted data. As an example, if we encrypt two numbers, as shown below, then the decrypted sum will reveal the correct summed result. See this primer for an illustrated discussion.
Image courtesy of Stephen Hardy.
HE is typically split into either Fully Homomorphic Encryption (FHE) or Partially Homomorphic Encryption (PHE).
FHE allows us to perform both addition and multiplication, but at the expense of computationally feasibility. The reason for this is that the additions and multiplications add a bit of noise to the resulting encryption. PHE is a compromise as we can achieve computationally feasibility, but can only perform restricted operations ie either addition of two encrypted value, or multiplication of an encrypted value with an unencrypted scalar.
There also exists a compromise between the two, known as Somewhat Homomorphic Encrytion (SHE), that only supports a limited number of homomorphic operations.
Data Privacy vs Increased Security
The importance and opportunities provided by such a capability are far reaching and somewhat profound.
Across industry and government, there is a need to ensure citizen privacy whilst maintaining national security through the use of, and access to, their data. Confidential computing provides a way for us to use the data without seeing it!
There are numerous examples of real-world applications for such techniques, such as cyber security, health and detecting financial crime and fraud.
Detecting, Detering and Disrupting Financial Crime
The detection of financial crime and fraud is particularly interesting as it poses some challenging problems and opportunities. It’s been estimated that each year, $US2.4 trillion in proceeds from criminal activities flows through the global financial system, but that less than \(1\%\) is detected.
Complex financial crimes typically involve transactions over multiple payment channels and geographies, involving many different financial institutions and parties, some of which are unregulated. These operate in silos, with each player only seeing one piece of the overall puzzle. Complex schemes can involve money laundering, terrorism financing, tax evasion and drug trafficking.
Confidential computing allows us to tackle such problems, by increasing surveillance whilst maintaining individual privacy.
To do this, we can apply the aforementioned principles of PPDM and HE in a distributed manner, as illustrated below:
Image courtesy of Kee Siong Ng.
The aim is to find a common set of entities of interest (EOI) to all parties, so they can collectively risk score them, for instance.
The parties can include Financial Intelligence Units (FIU) from around the world, various Reporting Entities (RE), such as banks, money remitters and casinos, Law Enforcement Agencies (LEA) and Partner Agencies (PA).
Each party wants to maintain the privacy of the entities they hold, but they collectively want to share information in order to build a risk scoring model.
To do so, they must first match data to determine the subset of EOIs that are of common interest, using PPDM, and then they need to use HE to encrypt the underlying data in such a way that still allows them to perform mathematical operations.
To further ensure privacy, and minimise data transfer, the data for each party remains distributed, stored across different databases, and potentially located anywhere in the world.
For a further detailed discussion on how to apply Confidential Computing to detecting financial crimes, here are some practical algorithms for distributed privacy-preserving risk modelling.
The takeaway message is this:
We can now analyse and use data without seeing it!