hyperloglog

HyperLogLog intersection: why not use min?


When doing a union between two compatible HyperLogLog objects, you can just take the maximum bucket to do a lossless union that doesn't introduce any new error:

Union.Bucket[i] = Max(A.Bucket[i], B.Bucket[i])

When doing an intersection though, you have to use the inclusion-exclusion principle:

IntersectionCountEstimate = A.CountEstimate() + B.CountEstimate() - Union.CountEstimate()

Why is it that using the minimum bucket value doesn't work as an effective intersection?

Intersection.Bucket[i] = Min(A.Bucket[i], B.Bucket[i])

Solution

  • The cause is that the relationship between two instances of the HyperLogLog statistic is not very intuitive:

    Consider two corresponding buckets A[i] and B[i] from separate HyperLogLog structures A and B (which have the same number of buckets and use the same hash function), and for simplicity's sake assume the data in A and in B are independently drawn from the same distribution. Let's assume we first draw all the elements for A, and only then draw elements for B.

    For every element we observe reaching B[i], what is the probability that is it in the intersection of A and B, i.e. what is the probability that it is already in A[i]? Well that depends - how "full" is A[i]? If A[i] is completely "full" (i.e., A[i] "contains" ALL the elements from the distribution which can reach A[i]), then the probability is 1. In that case, the cardinality of the intersection of A[i] and B[i] would indeed be the cardinality of B[i]. However, it is almost NEVER the case that A[i] is "full" - so the intersection is MUCH SMALLER than the cardinality of B[i].