The HyperLogLog algorithm by Flajolet et al describes a clever way to estimate the cardinality of a set using only a tiny amount of memory. However, it does take into account all N elements of the original set in the calculation. What if we had access to only a small random sample (say, 10%) of the original N? Has there been any research on how HyperLogLog or similar algorithms can be adapted to this situation?
I am aware that this is essentially the problem described as distinct value estimation, for which abundant research exists (see for example this paper for an overview). However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog. Therefore, I am wondering if someone has already thought of adapting HyperLogLog to the distinct value estimation problem.
However, the research on the distinct value estimation that I'm aware of uses a number of ad-hoc estimators very different from the approach used by HyperLogLog.
Yes, because they are solving a very different problem.
Suppose you just confiscated a stash of 1.000.000 counterfeit dollar bills, and you want to know the number of distinct serial numbers.
Sampling 100.000 of them (using HyperLogLog, as your antique steam-driven counting machine has only 1k memory) you count 5000 different serial numbers, each of which occurs somewhere around 20 times. Then you can be pretty sure that the whole stash will contain only a little over 5000 distinct serial numbers.
Now suppose that 1 serial number occurs 95.001 times, and 4999 serial numbers occur only once. Apparently some bona fide bank notes found their way into your stash. Now you can be pretty confident that the stash contains around 5% honest banknotes, so that the entire stash contains around 50.000 distinct serial numbers
Note that the distribution of the frequencies in your sample is used to infer something about the distribution in the entire stash. This is actually mentioned as one of the "ad hoc" (your words) methods in the second paper you cite ("Sampling-based estimation of the number of distinct values(..)"):
The idea behind a parametric estimator is to fit a probability distribution to the observed relative frequencies of the different attribute values.
Also note that the results of HyperLogLog and similar methods are completely insensitive to the distribution of the samples over their values. But your final estimate evidently depends very much on it!
My advice: use method of your choice (like HyperLogLog) to count the number of distinct values in your sample, and then use one of the methods in "Sampling-based estimation" to estimate the number of values in your entire multiset , or use your prior knowledge abut the distribution of the multiset to calculate an estimate (maybe you saw the counterfeiters' printing press, and you know it could only ever print one serial number)