pythondatabasehashbloom-filtermurmurhash

Bloom Filters with the Kirsch Mitzenmacher optimization


I have recently started playing around with Bloom Filters, and I have a use case in which this calculator suggests to use 15 different and unrelated hash functions. Hashing 15 times a string would be quite computing-intensive, so I started looking for a better solution.

Gladly, I encountered the Kirsch-Mitzenmacher optimization, which suggests the following: hash_i = hash1 + i * hash2, 0 ≤ i ≤ k - 1, where K is the number of suggested hash functions.

In my specific case, I am trying to crate a 0.5MiB Bloom Filter that should eventually be able to efficiently store 200,000 (200k) pieces of information. With 15 hash functions, this would result in p = 0.000042214, or 1 in 23,689 (23k).

I tried implementing the optimization in python as follows:

import uuid
import hashlib
import mmh3

m = 4194304 # 0.5 MiB in bit
k = 15

bit_vector = [0] * m

data = str(uuid.uuid4()).encode()

hash1 = int(hashlib.sha256(data).hexdigest(), base=16)
hash2 = mmh3.hash(data)

for _ in range(0, k-1):
    hash1 += hash2

bit = hash1 % m

bit_vector[bit] = 1

But it seems to get far worse results when compared to creating a classic Bloom Filter with only two hashes (it performs 10x worse in terms of false positives).

What am I doing wrong here? Am I completely misinterpreting the optimization - hence using the formula in the wrong way - or am I just using the wrong "parts" of the hashes (here I am using the full hashes in the calculation, maybe I should be using the first 32 bits on one and the latter 32 bits of the other)? Or maybe the problem is me encoding the data before hashing it?

I also found this git repository that talks about Bloom Filters and the optimization, and their results when using it are quite good both in terms of computing time and in false positives quantity.

*In the example I am using python, as that's the programming language I'm most comfortable with to test things out, but in my project I am actually going to use JavaScript to perform the whole process. Help in any programming language of your choice is much appreciated regardless.


Solution

  • The Kirsch-Mitzenmacher optimization is a proof of concept paper, and as such assumes a bunch of requirements on the table size and the hash functions themselves. Using it naively has stumbled people before you too. There is a bunch of practical stuff to consider. You can check out "6.5.1 Double hashing" of the P.C.Dillinger's thesis, linked in his long comment as well, in which he explains the issues and offers solutions. His solutions to rocksdb implementation issues can also be interesting.

    Maybe start by trying the "enhanced double hashing":

    hashes = []
    for i in range(0, k-1):
        hash2 += i
        hashes.append(hash1)
        hash1 += hash2
    

    (I don't know why your example code only generates 1 hash as result, and assume you pasted simplified code. 15 hash functions should set 15 bits.)

    If this doesn't work, and you are exhausted of reading/trying more solutions - say "it's not worth it" at this point, and choose to find a library with a good implementation instead, compare them for speed/accuracy. Unfortunately, I don't konw any to recommend. Maybe the rocksdb implementation, since we know an expert has worked on it?

    And even that may not be worth the effort, and calling mmh3.hash(data, i) with 15 seeds be reasonably fast enough.