algorithmdata-structuresestimationhyperloglog

Cardinality approximation for logical set operations – (The "HyperLogLog" for AND/OR/XOR)


we are currently facing an interesting problem. We would like to estimate the cardinality of a set without the need to store every single item (typically bitmaps/bitsets are a nice approach). A very nice algorithm is the so called HyperLogLog randomized algorithm (see more here http://antirez.com/news/75).

The problem here is, that you can only merge sets as UNIONs, so basically it's a OR combination.

We actually want not only to combine sets with OR, but as well with AND. We even want to combine these operations.

Example: set1 AND (set2 OR set3) OR (set4 AND set5)

Each set may have a cardinality in the range of millions. Each value has a size of 128 bit.

Each set can be represented in any way e.g. "HLL, bloom filter, a plain list, or a combination of these". The algorithm must execute in the shortest possible amount of time using a feasible amount of space.

Any ideas?


Solution

  • This exact problem is the subject of https://pdfs.semanticscholar.org/5da8/bf81712187712aed159aed62e38fb012872e.pdf. Their recommendation is to use bloom filters.

    The bloom filter for the union is the bitwise OR of the bloom filters. The bloom filter for intersection is the bitwise AND of the bloom filters. So you can easily generate the bloom filter of the operation that you want.

    Their theorem 1 allows you to estimate the size of a set from how many bits are set in its bloom filter.