In many software engineering problems, we have a set and need to determine if some value belongs to this set. If the possible maximum set cardinality (size; maximum size = total count of elements we consider) is small, the solution is straightforward: just store the set explicitly (for instance, in form of a RB-tree), update it when necessary and check if the set contains elements that we are interested in. But what if maximum set cardinality is large or we need many such sets to operate simultaneously? Or if the set membership test is an expensive operation?
Suppose we want to know if an element belongs to a set. We have decided that it is acceptable to get false positive answers (the answer is “yes”, but the element is not actually in the set) with probability
p and not acceptable to get false negative (the answer is “no”, but the element in actually in the set). The data structure that could help us in this situation is called the Bloom filter.
A Bloom filter (proposed by Burton Howard Bloom in 1970) is a bit array of
m bits (initially set to 0) and
k different hash functions. Each hash function maps a value into a single integer number.
Look at this picture from Wikipedia:
m = 18 and
k = 3. The filter contains three values:
To add an element to the filter, it must be hashed by all the hash functions. The result of this hashing is
k integer values, which should be mapped to the count of bits in the filter and corresponding bits should be set to 1. Various hash functions and hashing scheme are possible. The main requirement is the uniformity of digest distribution. Cryptographic hash functions such as SHA-1 are good for that, but they are comparably slow. There are good non-cryptographic hash functions such as MurmurHash or Jenkins that are quite fast and well-distributed. For instance, Hadoop uses MurmurHash for 32 bits (or Jenkins, it can be chosen) and the following scheme (see the code here and here):
- Compute the first hash
husing the seed of 0:
h = hash(data, 0).
- Compute hash
h[i]using the previous digest
h[i-1]as a seed:
h[i] = hash(data, h[i-1]).
- Map each hash to a bit number as
pos = Math.abs(h[i] % numberOfBits).
This using of the previous digest as seed is a way to get
k different hash functions from one actual hashing algorithm.
Considering the example from the picture. For
x, the hash functions have generated three (mapped) bit positions (zero based): 2, 6 and 13. For
z positions are 3, 5 and 11. As you can see, there is a collision in position 5.
To test if a value is in a set represented by a Bloom filter, we again do the bit position computation. But in this case, instead of setting the corresponding bits we check if they are set to 1. If all of the bits for the value are positive, the values is probably belongs to the set. If one of the bits are 0, the value is certainly not in the set.
But what should we do with false positives? This is really depends on the application. For example, Apache Cassandra (and probably many others stores) uses the Bloom filter to reduce the disk lookups for non-existent columns and rows: so in such a case if you get a positive testing result you can just load necessary block from the disk and actually see if the key exists. If the testing result is negative it is unnecessary to load the block. In other application you may not have an actual set (for example, you are counting unique visitor per minute on all pages of your web site). In that case, you just give up and accept the possibility of false positive result. Anyway, the accuracy of the Bloom filter is adjustable.
The Bloom filter has two parameters: the number of bits in the filter
m and the number of hash functions
k, but how do we choose them? In Wikipedia and here you can find all the math. Briefly, you can set
p is a probability of false positive,
n is the number of elements.
There are plenty of applications of the Bloom filter. Some of them are mentioned in Wikipedia: Apache Cassandra and Apache HBase use it to reduce disk operations, Google Chrome checks malicious URL clinet-side before sending them to the server, BitCoin uses it to speed up the wallet synchronization. Also, Quora implemented a sharded bloom filter in the feed backend to filter out stories that people have seen before; Facebook uses bloom filters for the typeahead search, to fetch friends and friends of friends to a user typed query (from here) and many more.
Beyond the classic Bloom filter
There are some extensions for the Bloom filter, such as the Scalable Bloom filter that grows dynamically with the size of data, Stable Bloom filter that is well applicable to streaming data, Counting Bloom filter that counts values and also can remove them from a set. You can find a good overview here.
I have implemented a version of the Bloom filter, you can find it on Github, along with many others implementations.