algorithmbloom-filter

How do bloom filter implementations keep clean?


Since they fill up and the percentage of false positives increase, what are some of the techniques used to keep them from saturating? It seems like you cannot empty out bits, since that would make an immediate negative on data stored in that node.

Even if you have a set of known size, in a data store using bloom filters like Cassandra what confuses me is that the data in a node is going to be added and removed, right? But when you remove a key you cannot set its bloom filter buckets to 0 since that might create a false negative for data in the node that hash to one or more same buckets as the removed key. So over time, it is as if the filter fills up


Solution

  • I think you need to set an upper bound on the size of the set that the bloom filter covers. If the set exceeds that size, you need to recalculate the bloom filter.

    As used in cassandra, the size of the set covered by the bloom filter is known before creating the filter, so this is not an issue.

    Another aproach is Scalable Bloom Filters