# In Bloom – Nirvana?

I recently discussed how Confidential Computing allows us to analyse and use data without actually seeing it.

A key component of Confidential Computing is Privacy-Preserving Record Linkage (PPRL), and one of the most widely used techniques within PPRL is the Bloom Filter (BF), which is the focus of this post.

The aim is to give an overview of BFs, the role they play in PPRL and whether or not they’re as secure as many believe.

But before we dive into the world of BFs, we must first discuss the beauty of Hashing…

###### #Hash Functions

Cryptographic Hash Functions are one-way functions (mathematical algorithms) that map an input string, of arbitrary size, to (typically) a smaller and fixed size output (hash value). Basically, they’re a ‘fingerprint’ of the input data, as they uniquely identify it.

Being one-way means that they are practically impossible to invert, ie you can’t retrieve the original value given the hashed value. In addition, other properties that a hash function must meet include:

1. Fast,
2. Uniformly distributed output,
3. Deterministic ie the same input always maps to the same output,
4. Unique ie each hash value only corresponds to one input value, and
5. Tweaking the input generates a totally different hash value to the unmodified input.

Hash functions, in general, are used widely in such areas as indexing databases (to speed up search) and password verification.

Given that they produce a fixed-length output, hash ‘collisions’ are technically possible ie when two different inputs produce the same hash value. One possible way around this issue is to use hash algorithms that produce longer hash values.

Numerous cryptographic hash algorithms exist, and some of the most popular are SHA and MD5. SHA-256, for instance, produces a 256-bit hash, thus making it very unlikely for collisions to occur.

As an example, here’s the hash generated for https://impartiallyderivative.com/ by SHA-256:

2817df2e0bb1dd03c2d517a8affdcaf2735b011e2f11cbf39267403edeaa259a

Another application of cryptographic hash functions is in the wonderful world of Bloom Filters…

###### Blooming Marvellous

The Bloom Filter was invented by Burton Howard Bloom in 1970. It was basically designed to efficiently represent sets, and to test membership of an item in a set. Structurally, it’s simply a bit array.

They are based on hash functions, and hence are probabilistic data structures, with the important property of being space efficient.

There are two operations you can perform on a BF:

1. Add an element to the set, and
2. Test whether an element is contained within the set.

In regards to querying the set, BFs are unique in that:

• False Negatives are not possible – the query returns whether an element is definitely not in the set, but
• False Positives are possible – the query returns whether an element is only possibly in the set (hence, being probabilistic).

So, how do they work?

The basic algorithm is as follows:

• Initialise the BF, a bit array of $$m$$ bits, to $$0$$.
• Define $$k$$ different hash functions.
• To add an element to the BF, simply pass it to each hash function, which maps it to a single element of the bit array and sets the value to $$1$$.
• To query an element, ie to test if it’s contained in the BF, first pass it to each hash function, and if any of the bit array positions are set to $$0$$, then the element cannot be contained in the BF (otherwise all the bits would be set to $$1$$).

Here’s a simple example: consider a BF of size $$10$$, consisting of two hash functions, $$k_{1}$$ and $$k_{2}$$.

We begin by initialising the BF to 0:

We then wish to add an element to the BF, say “horse”. To do this, we pass it to both hash functions, which return the following bit positions:

$$k_{1} = 3$$ and $$k_{2} = 7$$.

Our BF now looks like this:

We next wish to add another element, say “chicken”. The two hash functions return the following bit positions:

$$k_{1} = 4$$ and $$k_{2} = 9$$.

Our BF now looks like this:

Now, let’s query the BF. Say we want to check if “dog” is contained. We first need to determine the bit positions, from our two hash functions, and check if they’re set to $$1$$. The two hash functions return the below bit positions for “dog”:

$$k_{1} = 1$$ and $$k_{2} = 5$$.

As we can see, these positions are still $$0$$, so we can conclusively say that “dog” is NOT contained.

But what if we query “goat”, which hashes to the following positions?

$$k_{1} = 3$$ and $$k_{2} = 9$$.

As we can see, the query will return Yes, as both positions are set to $$1$$, even though “goat” is not contained in the BF. This is an example of a ‘collision’ (shown in red) and has resulted in a False Positive (FP)!

So, what can we do about dealing with collisions in BFs?

To control the FP rate, we need to control the size of the BF, but there are compromises…

• Increasing the size of the BF will decrease the FP rate, but will slow it down.
• Decreasing the size of the BF will make it faster, but increase the FP rate.

We can also adjust the number of hash functions we use.

• Too many hash functions make it slower, as it fills up quickly.
• Too few hash functions are faster, but also increase the FP rate.

Fortunately, there’s a formula that represents the relationship:

$$p \approx \left( 1 – e^{\frac{-kn}{m}} \right)^{k}$$

where $$p$$ is the error (FP) rate, $$k$$ is the number of hash functions, $$m$$ is the size of the BF and $$n$$ is the number of elements inserted.

The optimal number of hash functions, $$k$$, can be shown to equal:

$$k = \frac{m}{n} \ln 2$$

You may have noticed that one thing we haven’t discussed is what if we want to remove an element from a BF?

Well, we can’t.

The reason is that False Negatives are not permitted.

Consider for instance if we wanted to remove “goat” from our above example, which hashed to positions $$3$$ and $$9$$. We could simply set these values back to $$0$$ correct?

Well, not quite.

Remember that “horse” mapped to position $$3$$ and “chicken” to position $$9$$, so this would mean that we’d be deleting them too!

However, there are structures that contain all the best parts of a BF but also allow removal: Counting Bloom Filters.

With Counting Bloom Filters (CBF), instead of storing single bit values, we instead store integer values ie a CBF is an integer vector.

So, instead of setting the position to $$1$$ for an element, we instead increase the previous value by $$1$$. To query a value, we simply test if the position value is greater than $$0$$. To remove an element, we simply decrease the position value by $$1$$.

The trade-off with CBFs vs BFs is increased size.

We now know a bit about BFs, and how they’re based on hash functions. So how do they help with Privacy-Preserving Record Linkage?

###### Blossoming Privacy

The aim of PPRL is to find records that represent the same entity in separate data stores, most commonly databases, without revealing the identity of the entities. They rely on balancing linkage quality with privacy.

In practice, this is achieved by calculating the similarity between two records. As it happens, BFs are a popular technique to do this, as we can easily calculate the similarity between BF encoded values, in either string or numerical format.

There are numerous methods for encoding records into BFs, but the general process is as follows:

• Consider a string “gerhard” that we wish to encode.
• We begin by generating a set of $$\mathbf{q}$$-grams ie sub-strings of length $$q$$, determined from our given string by a sliding window approach. For instance, the $$2$$-grams (bigrams) for “gerhard” are: “ge”, “er”, “rh”, “ha”, “ar” and “rd”.
• We next hash each of these $$q$$-grams into a BF.
• In order to calculate the similarity (ie degree of match) between two BFs, the Dice coefficient is commonly used, and is defined as:

$$sim_{D}(\mathbf{b_{1}}, \mathbf{b_{2}}) = \frac{2 c}{x_{1} + x_{2}}$$

where $$c$$ is the number of common bit positions set to $$1$$ in both BFs, and $$x_{1}$$ and $$x_{2}$$ are the number of bit positions set to $$1$$ in the respective BFs, ie $$\mathbf{b_{1}}$$ and $$\mathbf{b_{2}}$$.

Let’s now see how this works by the following example:

Consider two names, “gerhard” and “gerrard” that we wish to encode in BFs, and then compare their similarity to see how well they match. Here’s the process, with an illustration below (with collisions indicated in red):

• We begin by splitting the two names into bigrams, for instance:
• “gerhard” = “ge”, “er”, “rh”, “ha”, “ar”, “rd”; and
• “gerrard” = “ge”, “er”, “rr”, “ra”, “ar”, “rd”.
• We next encode these bigrams into BFs using two hash functions, for instance.
• Finally, we calculate the Dice coefficient similarity between the two names:

$$sim_{D}(\mathbf{b_{1}}, \mathbf{b_{2}}) = \frac{2 \times 8}{10 + 9} = 0.84$$

The degree of the match is strong, as expected, given the high number of common bigrams between the two BFs.

As an aside, remember CBFs… Well, they can be used for matching across multiple databases.

###### It’s Not All Roses

Even though BFs are currently being used in practical applications of PPRL, and offer some true benefits, they do have their limits. Most notably, they are vulnerable to attacks!

A common set of attacks, known as frequency attacks, aim to re-identify encoded values from sets of BFs by exploiting the fact that the frequencies of values in an encoded database can be matched to frequent plaint-text values, such as names. These attacks simply use the Hamming weight, ie the number of 1-bits of a BF, to define the frequencies.

In order to improve the security of BFs, Hardening techniques are used to try thwart cryptanalysis attacks. Two common examples include Balancing and Salting.

###### Balancing

These techniques are based on the fact that attacks on balanced BFs, ie BFs with constant Hamming weights, are harder to attack.

One approach is to concatenate a BF , of size $$m$$ say, with a negated copy of its self (ie flipping all the bits, so $$1$$‘s become $$0$$‘s and vice versa), thus resulting in a new BF of length $$2m$$. Privacy is further improved by then permuting all the $$2m$$ bits. The resulting BF will have a uniform Hamming weight as now half of the bits are set to $$1$$. The compromise is that memory requirements and computing time are increased.

Let’s illustrate this via an example. Consider the below BF, of length $$4$$:

To balance this BF, we  want to concatenate it with a negated copy of itself, ie

The resulting BF, of length $$8$$ looks like this:

Finally, the resulting BF is randomly permuted, and for example, now looks like this:

###### Salting

These techniques add extra string values to each $$q$$-gram before hashing. The idea is to reduce the frequency relative to unsalted values.

For example, let’s say we wanted to hash the names of two people, who both happen to be called “fred”. If we just went ahead, and generated $$q$$-grams as normal, they’d both be hashed to the same bit positions by the BF encoding process. What we can do instead, for instance, is add the salting value of Age to each $$q$$-gram before hashing, so that they’ll be hashed to different bit positions instead.

For instance, the bigrams of two “fred’s”, with different ages, may look something like this:

Fred 1: “fr27”, “re27” and “ed27”.

Fred 2: “fr35”, “re35” and “ed35”.

This results in the frequency of “fred’s” bigrams being reduced, hence helping thwart frequency attacks on bigrams that may otherwise frequently occur.

Hardening is an interesting and ever-growing area of research in the field of BFs, as the limits of their security are tested…