data-structuresprobabilitycardinalityhyperloglog

Merge uniq counters, probabilistic data structures


There are two sets 1 2 3 and 3 4 with 3 and 2 unique items.

Now let's calculate unique items in merged set. If we just sum up the counters 3 + 2 = 5 it will be wrong (it should be uniq(1 2 3 3 4) = 4).

Is there a way to do it using only the counters? For each counter it's ok to use some extra constant memory data structure representing the original set, small errors are also ok, let's say 95% accuracy is ok.

I know there are probabilistic unique counters using very little memory (HyperLogLog). But is there a way to merge two such probabilistic counters?


Solution

  • Yes, HyperLogLog actually allows merging quite naturally, and most implementations include merging. In a nutshell, to merge two HyperLogLog structures A and B into a new one C, take the maximum of each bucket pair C[i] = max(A[i],B[i]).