Bloom filters

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 09 Apr 2012
Tagged with:

In a span of two months or so, I’ve noticed a peak in implementation of bloom filters. Maybe the “if you got a hammer, everything looks like a nail” applies here, but statistically I’m doing a larger number of bloom filter implementation as usual.  Yet, most of my co-workers never really heard of bloom filters, and I’m continuously need to explain what they are, what their purpose is and why it’s a better solution than other ones. So let’s do an introduction on bloom filters.

Defining the problem

So we already know the solution will be a Bloom Filter, but what is the problem they can solve? Suppose you have a book which obviously contains a lot of words. My problem: I want to know if a certain word is actually inside the book. I don’t care where it is, or how many times it’s inside the book: I just want to know if this book contains this specific word or not. As a good PHP developer, you know PHP has got a fair amount of array functionality, so let’s try and use those:

$words = array();
$book = file_get_contents("book.txt");
foreach (split(" ",$book) as $word) {
  $words[strtolower($word)] = 1;
}

if (array_key_exists("supercalifragilisticexpialidocious", $words)) {
  print "Yes, this word is used in the book";
} else{
  print "No, this book is probably not a Mary Poppins book";
}

Now this is a perfectly valid solution: we store each word of the book as the key of the array so when we encounter a word we already have added, it will not be added twice. However, depending on the size of the book, this method is quite memory consuming. Luckily the size of the array doesn’t really affect the time to look for a word. Or to put it differently: the time it takes to check if a word exists is probably the same for an array with 10 words, as it is with an array with 10 million words (so called O(1) or constant time operations).

But still we are using a lot of memory, there must be better structures and there are: the bloom filter.

Introducing the bloom filter

Bloom filters have the property of being exceptionally fast AND exceptionally small compared to other structures but it comes with a price: it MIGHT be possible that our bloom filter thinks that an element is inside our set, when it really isn’t. Luckily, the reverse is not possible: when a bloom filter says something is NOT in the set, you are 100% sure that it isn’t part of the set.

Now I know most developers hate the word “maybe”, since you cannot really convert the term “maybe” into binary logic unless you start using fuzzy logic, so how does this help us? If we go back to our book-example, we could ask the bloom-filter if a word is inside the book. It will return with: no or maybe. When the answer is “no”, we know for a fact that the word is not in this book, and we could for instance, check the next book until we hit a “maybe” result. In this case, we need to check the book word for word in order to verify that the word is actually used.

This does speed up the search process a lot: first of all, we can store many bloom-filters in memory so we can scan very fast if words are definitely NOT present inside a range of books. When we do find a “maybe”, all we need to do is to get a the array of that specific book so we can find out for sure if the word is present (assuming you need 100% certainty, otherwise you don’t even need to do this). Our gain is in the fact that A) we can store many more bloom filters, and thus search more books in a smaller amount of time and B) we only need to verify the books where we actually might have a hit.

This is also a perfect filter for unnecessary disk access: suppose we store keys onto disk (or another slow medium like a database), where we need these keys on occasion. Now, first we ask the bloom filter if the key is present. If it says “no”, we know the key isn’t there so we don’t need to check any further. However, when it says “maybe”, we will fetch the key from that other medium (database or disk). If it’s there, we can return this info, but when we have “false positive”, (bloom filter said: maybe, but in the end, it wasn’t there), then we just have done a unnecessary read to that medium. However, we only do this on occasion, and not for ALL our requests. In the end, bloom filters are a great way to filter out unwanted slow access request.

How does it work

How can a data structure tell us either “no” or “maybe”. Can’t it just tell use “yes” or “no”? Well.,.. no.. :)  Let’s take a look how a bloom filter is made out of:

We have a bit-array of “m” bits. Let’s say, m = 20. Each of these bits are set to 0.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-----------------------------------------
                    1 1 1 1 1 1 1 1 1 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Now, we have a number of hash functions. This amount is called “k”. A hash function is a function that turns a block of data into a number, in our case a number between 0 and m. These hash functions are pretty similar to md5, crc32 and sha1, but are much faster. Now, what happens is that when we add data to our bloom filter, this data gets hashed by all the hashes in the bloom filter (so k times). Let’s assume the following:

  • data: “key1”
  • k1(key1) = 4
  • k2(key1) = 3
  • k3(key1) = 15

This means: our data “key1” is hashed to 4, 3 and 15 according to our hashes. Now, the bloom filter will set these bits in our array to 1.

0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
-----------------------------------------
                    1 1 1 1 1 1 1 1 1 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

Let’s add another key:

  • data: “another key”
  • k1(key1) = 8
  • k2(key1) = 19
  • k3(key1) = 3

And of course, add these numbers to our array as well:

0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0
-----------------------------------------
                    1 1 1 1 1 1 1 1 1 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0

You see that the “3” was already set, so it stays set to 1. Now, let’s ask our bloom filter if “somekey” is present. We start the same way as adding data: we use the same hash functions:

  • data: “some key”
  • k1(key1) = 1
  • k2(key1) = 10
  • k3(key1) = 19

Now we search the bloom filter and find if ALL of these bits are set to 1. In this case, only bit 19 is set, but bit 1 and bit 10 aren’t. We can say for 100% certainty that “some key” is NOT in our bloom filter.

But let’s try with “key1”. We already know that the hashes will be 4, 3 and 15. And all these bits are set inside our bloom filter. Now, you might be tempted to think that we can just return a “yes”, but there is a problem with this. Let’s try another key:

  • data: “unknown key”
  • k1(key1) = 3
  • k2(key1) = 19
  • k3(key1) = 4

If we are checking for this key in our filter, we actually see that bits 3, 4 AND 19 are all set, yet we never have added this key to the bloom filter! This is because these hash all return numbers that have previously been set by other keys and this is where the “maybe” comes from. The bloom filter cannot know if these bits happen to be there because you’ve added the key to the set, or that by coincidence all bits happened to be set by other keys.

Error rates

This means that the more hash functions you use, the less chance of errors will be. Also, the more elements we add to the bloom filter, the higher the chance of errors. There are some mathematical formula’s available which you can calculate the chance of errors given the number of elements inside a bloom filter, but most implementations of bloom filters just give you 2 options: how many elements are you going to store, and what is the error rate you need. With these two variables, these implementations can calculate how many hash functions it needs, and/or how large the bit array should be.  There is actually a nice calculator that shows you how this works.

There are other things to take care of: obviously: when ALL the bits of a bloom filter are set, it will ALWAYS return “maybe”. Some bloom implementations can also return the probability of the maybe so you can assume it’s a “yes” when it has a 90%+ probability, and do another check if it’s lower for instance. Again, this all depends on your implementation and needs.

Conclusion

For PHP, there is a really nice and simple PECL extension called bloomy which you can create large bloom filters quite easily, but better yet: why not try and create a bloom filter in PHP yourself! Bloom filters by itself can be very useful for large sets where you only need to determine presence. It’s not a key-value store so you probably need other structures as well, but simple bloom filters can be a really good way to speed up your performance and decrease your memory usage.