Sunday, January 15, 2012

Let's make a Bloom filter

Bloom filters are a simple yet brilliant data structure, and once you learn about them, you start seeing applications for them everywhere. I'm going to quickly go over the absolute basics of what a Bloom filter is and how it works; for much more detail, the Wikipedia page is good.

A Bloom filter is a set data structure for answering the question "Is this thing in the set?" It has two basic operations:

  1. BloomFilter.add(x) adds x to the set.

  2. BloomFilter.has(x) returns either false, if x is definitely not in the set, or true if x is probably in the set.

Notice that a Bloom filter can't tell you that something is definitely an element of the set. There is always a possibility that it will return true for something that actually isn't in the set. This false positive rate can be made arbitrarily low, but it will always be there.

In exchange for this small probability of false positives, a Bloom filter can be much smaller than storing the set itself. For example, suppose you want to store a set of a million email addresses, and each one is 16 bytes long, on average. That's 16 MB of strings, right there. If you're willing to accept a 0.1% false positive rate, though, you can make a Bloom filter for those email addresses that's only 1.8 MB! The space savings are even bigger if you're not too picky about the false positive rates.

There are all sorts of clever things you can do with such a splendidly space-efficient set data structure, but I won't go into them here. Instead, let's just talk about how to make one.

A Bloom filter, internally, is just an array of bits. They start out as all zeros. Here's a picture of a really small Bloom filter, which will start out empty:

empty bloom filter

Let's add the strings "hello" and "world" to it. To add an item to a Bloom filter, we need several different hash values for it. For our little toy example here, let's say we have three hash functions, creatively named hash1(x), hash2(x), and hash3(x). If we hash "hello" with these functions, we get some numbers; let's say 1, 9, and 12. We can think of this as like a fingerprint for "hello"; to add "hello" to the Bloom filter, we set bits number 1, 9, and 12 of the Bloom filter to 1.

hello fingerprint

We repeat that process with "world", getting a different fingerprint (this time, it's 2, 6, and 14):

world fingerprint

We combine these fingerprints with a bitwise OR to get the Bloom filter containing "hello" and "world":

bloom filter combination

So, that's how we add to a Bloom filter. How do we look things up?

We know that if some key has been added to a Bloom filter, then all the bits of its fingerprint will be 1. So we hash the key, and look at the corresponding bits of the Bloom filter. If they're all 1, then we say that the key is probably in the set. If any of them is 0, then we know for sure that the key is not in the set.

Let's code this up!

First, we need a way to get several independent hash values for a key. This method is not mathematically ideal, but this is toy code, and it's short:

Does it work? Let's give it a try, generating 15 hashes of "rage nostrils" modulo 10:

>>> hashes('rage nostrils', 15, 10)
[2, 7, 1, 6, 0, 7, 7, 6, 1, 4, 2, 5, 9, 8, 4]

It works! Now let's code up the most breathtakingly inefficient Bloom filter implementation ever:

You can try it out at home, maybe have a look at that array of bits after adding a few things to the Bloom filter. Here's how I played with it:

Where to go from here

If you want to use Bloom filters for something, there are libraries for it. In Python, PyBloom just works. It's not the fastest, but if you're using Python, you probably don't care. For Java and other JVM languages, there's the simple java-bloomfilter, or the more featureful greplin-bloom-filter. Haskell has the excellent bloomfilter library, written by one of the authors of the chapter on writing Bloom filters in Real World Haskell. Most other languages have something, if you google for it.

And that's it, really. There are a lot of fancy things you can do with Bloom filters, and a lot of terribly clever things you can do to make them faster or more compact, but the basic theory of operations is really, really simple.

Sunday, July 24, 2011

Port of Murmur3 hash to C

Murmur3 is a non-cryptographic hash function, stupendously fast, with excellent mathematical properties. It's just the thing to use if you want to make a hash table, or a Bloom filter, or do any number of other things that require fast, high-quality hashing. Unfortunately, the original code was in C++ and designed mostly for Microsoft Visual Studio, making it not the tool of choice for those of us who would prefer C code, compiled with gcc or something similar. So I ported the code to C and cleared out the MSVC-specific parts.

It's public domain, so do whatever you want with it. There's just a single C file and a short header file, so it should be really easy to incorporate into your own projects. See the readme file on GitHub for more details, or the example program to see how it's used.

Happy hashing!