In one of my previous posts on Bloom Filters, I stated the expressions for both the False Positive rate, ie

$$p \approx \left( 1 – e^{\frac{-kn}{m}} \right)^{k}$$

and the optimal number of hash functions, ie

$$k = \frac{m}{n} \ln 2$$

In this post I will detail the derivations of both expressions.

**Derivation of the False Positive Rate**

**Derivation of the False Positive Rate**

Our aim is to insert elements from a set \(S\) with \(n\) elements, ie \(S = \{s_{1}, s_{2}, …, s_{n}\}\), into a Bloom Filter \(B\), with \(m\) bits, using \(k\) hash functions, \(h\).

Now, the probability that a hash element in \(B\) has been set to \(1\), assuming independent and random hash functions, is simply \(\frac{1}{m}\). Hence, the probability that a bit has NOT been set to \(1\) is \(\left(1 – \frac{1}{m}\right)\). Thus, the probability that a bit element has NOT been set to \(1\) by any of the \(k\) hash functions is

$$\left(1 – \frac{1}{m}\right)^{k}$$

After all \(n\) elements of \(S\) have been inserted into \(B\), the probability that a specific bit is still \(0\) is

$$p_{0} = \left(1 – \frac{1}{m}\right)^{kn}$$

The False Positive rate, ie that a specific set of bit elements is \(1\), is

$$p = \left(1 – \left(1 – \frac{1}{m}\right)^{kn}\right)^{k}$$

However, we can simplify this expression by using the following definition

$$\lim_{n \rightarrow \infty} \left(1 – \frac{1}{n}\right)^{n} \approx e^{-1}$$

and along with a bit of simple mathematical magic, we get

$$\left(1 – \frac{1}{m}\right)^{kn} = \left(1 – \frac{1}{m}\right)^{m\frac{kn}{m}} \approx e^{-1.\frac{kn}{m}} \approx e^{-\frac{kn}{m}}$$

Therefore,

$$p \approx \left(1 – e^{-\frac{kn}{m}}\right)^{k}$$

QED.

**Derivation of the optimal number of hash functions \(k\)**

**Derivation of the optimal number of hash functions \(k\)**

Our aim is to find the optimal number of hash functions \(k\) that minimise the False Positive error \(p\), as defined above.

To begin with, and to simplify our expressions (ie we won’t want messy looking \(e\)‘s and exponents floating around if we don’t have to), let \(u = e^{-\frac{kn}{m}}\).

The False Positive rate is then

$$p = \left(1 – u\right)^{k}$$

Now, let’s use some simple and fundamental, yet powerful, properties of exponentials and logarithms to express \(p\) slightly differently.

Firstly, given \(a = e^{\ln a}\), we get

$$p = \left(1 – u\right)^{k} \equiv e^{\ln \left(1 – u\right)^{k}}$$

Secondly, given \(\ln a^y = y \ln a\), we get

$$p = e^{k \ln \left(1 – u\right)}$$

The reason we bothered to go through all that is because it helps us with what comes next.

As mentioned above, our aim is to determine the optimal value for \(k\) by minimising \(p\). The important point to note is that minimising \(p\) is equivalent to minimising \(g = k \left(1 – u\right)\), ie the exponent of \(e\) in the expression for \(p\), which we’ve labelled \(g\) for convenience.

But why is this of importance? To answer this, involves knowing what comes next. To minimise the function \(g\) with respect to \(k\), is equivalent to determining the value of \(k\) for where the first derivative of \(g\) equals \(0\). Trust me, and if you don’t believe me, it’s time to refresh your first year calculus ðŸ™‚

Now, when it comes to differentiation, especially with respect to only one variable, differentiating exponentials and natural logarithms makes things much easier given their relatively simple differential properties.

So, as stated, the local minimum for \(g\) is the value of \(k\) for which

$$\frac{d g}{d k} = 0$$

Let’s go ahead and do some fun calculus, remembering that we previously let \(u = e^{-\frac{kn}{m}}\):

\begin{array}{rcll} \frac{d g}{d k} & = & \ln (1 – u) + k \left(\frac{-\frac{d u}{d k}}{(1 – u)}\right) & \text{via the product rule} \\& = &\ln(1 – u) + \frac{-k}{(1-u)}\frac{d}{d k}e^{-\frac{kn}{m}} & \\& = & \ln(1-u) + \frac{-k}{(1 – u)}e^{-\frac{kn}{m}}\frac{d}{d k} \left(-\frac{kn}{m}\right) & \\& = & \ln(1 – u) + \frac{-k}{(1 – u)}e^{-\frac{kn}{m}}\frac{-n}{m} & \\& = & \ln(1 – u) + \frac{kn}{m}\frac{e^{-\frac{kn}{m}}}{(1 – u)} & \\& = & \ln(1 – u) + \frac{kn}{m} \frac{u}{(1 – u)} & \end{array}

Now, let’s once again make the most of some simple exponential and logarithmic properties:

\begin{array}{rrcl} & u & = & e^{-\frac{kn}{m}} \\ \Rightarrow & \ln u & = & -\frac{kn}{m} \\ \Rightarrow & \frac{kn}{m} & = & -\ln u \end{array}

Therefore,

$$\frac{d g}{d k} = \ln(1 – u) – \ln u \frac{u}{(1 – u)}$$

As you can see, what we’ve done is just express \(\frac{d g}{d k}\) only in terms of \(u\), by eliminating the \(k\), \(n\) and \(m\) terms. This simplifies the expression a great deal.

We finally need to find the value for \(k\) for which this expression equates to \(0\):

\begin{array}{rrcl} & \frac{d g}{d k} & = & 0 \\ \Rightarrow & 0 & = & \ln (1 – u) -\ln u \frac{u}{(1 – u)} \\ \Rightarrow & \ln u \frac{u}{(1 – u)} & = & \ln(1 – u) \\ \Rightarrow & u \ln u & = & (1 – u) \ln (1 – u) \end{array}

Via symmetry, we can see that the only possible solution for \(u\) is \(\frac{1}{2}\).

Thus, using the property that \(\ln e^{a} = a\), along with the fact that \(\ln 1 = 0\):

\begin{array}{rrcl} & e^{-\frac{kn}{m}} & = & \frac{1}{2} \\ \Rightarrow & -\frac{kn}{m} & = & \ln \frac{1}{2} \\ \Rightarrow & -\frac{kn}{m} & = & \ln 1 – \ln 2 \\ \Rightarrow & -\frac{kn}{m} & = & -\ln 2 \\ \Rightarrow & \frac{kn}{m} & = & \ln 2 \\ \Rightarrow & \frac{kn}{m} & = & \ln 2 \end{array}

Therefore,

$$k = \frac{m}{n} \ln 2$$

QED.