data-structuresxorbloom-filter

What is an XOR filter?


There's a relatively new data structure (2020) called the XOR filter that's being used as a replacement for a Bloom filter.

What is an XOR filter? What advantages does it offer over the Bloom filter? And how does it work?


Solution

  • An XOR filter is designed as a drop-in replacement for a Bloom filter in the case where all the items to store in the filter are known in advance. Like the Bloom filter, it represents an approximation of a set where false negatives are not allowed, but false positives are.

    Like a Bloom filter, an XOR filter stores a large array of bits. Unlike a Bloom filter, though, where we think of each bit as being its own array slot, in an XOR filter the bits are grouped together into L-bit sequences, for some parameter L we'll pick later. For example, an XOR filter might look like this:

     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
     | 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
    

    Next, we pick three hash functions h1, h2, and h3 that, like a Bloom filter, hash items to slots in the array. Those hash functions let us take an item x and compute its table code, which we do by XORing together the items in spots h1(x), h2(x), and h3(x). An example of this is shown here:

     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
     | 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
                 ^                       ^       ^
                 |                       |       |
                h3(x)                   h1(x)   h2(x)
                  
                Table code for x:   10010 xor 01001 xor 10101
                                  = 01110
    

    To complete the picture, we need one more hash function called the fingerprinting function, denoted f(x). The fingerprinting function takes a value as input and outputs an L-bit number called the fingerprint of x. To see whether x is stored in the table, we check whether the table code for x matches the fingerprint for x. If so, we say that x is (probably) in the table. If not, we say that x is (definitely) not in the table.

    It's helpful to compare this idea against a Bloom filter. With a Bloom filter, we hash x to a number of positions, then derive a value from those positions by AND-ing everything together, and finally check if the value we got was equal to 1. With an XOR filter, we hash x to three positions, derive a value from those positions by XOR-ing them together, and finally check if the value we got is equal to f(x).

    To change the false positive rate for the XOR filter, we simply change the value of L. Specifically, the likelihood that f(x) coincidentally happens to match the XOR of the three locations given by h1(x), h2(x), and h3(x) is 2-L, since that's the probability that a random L-bit value matches another. Therefore, to get a false positive rate of ε, we simply set L = log2 ε-1.

    The challenging part is filling in the table. It turns out that there's a really simple strategy for doing so. To store a list of n elements, create a table of size 1.23n. Then, use this recursive procedure:

    1. If there are no items left to place, you're done.
    2. Pick an item x with the following property: x hashes to a table slot (say, slot k) that no other items hash to.
    3. Remove x from the list of items to place and recursively place the remaining items.
    4. Set the value of slot k to a number such that the XOR of the table slots x hashes to equals f(x). (This is always possible: simply XOR together the contents of the other two table slots and f(x), then store that in slot k.)

    This procedure has a slight chance of getting stuck in step (2) if every table slot has at least two items hashing to it, but it can be shown that as long as you use at least 1.23n table slots the probability that this occurs is extremely small. If that happens, simply choose new hash functions and try again.

    XOR filters have several advantages over regular Bloom filters.

    XOR filters have one main disadvantage relative to Bloom filters, and that's that all the items to store in the XOR filter must be known in advance before the filter is constructed. This contrasts with Bloom filters, where items can be added incrementally over a long period of time. Aside from this, though, the XOR filter offers better performance and memory usage.

    For more information about XOR filters, along with how they compare in terms of time and space to Bloom filters and cuckoo filters, check out this set of lecture slides, which explains how they work, along with where the 1.23 constant comes from and why we always use three hash functions.