perlredishyperloglog

redis HLL too many false positives


Hyperlog log is a probablistic algorithm According to the redis HLL document , we could get 0.81% of error but I get errors like 17-20%

I think there is something wrong .. This is my simple perl test script. Is there some error

#!/usr/bin/perl -w                                                                                                                                                       
use Redis;
my $redis = Redis->new(server=>'192.168.50.166:6379') or die;
my $fp=0;
my $HLL="HLL";

$redis->del($HLL);
foreach my $i (1..10000) {
  my $s1 = $redis->pfadd($HLL,$i);
  if($s1 == 0){ 
    print "False positive on $i\n";
    $fp++;
  }
}
print "count of false positives $fp\n";

Solution

  • HyperLogLog is used for counting unique items. It can count a large number of items with a little memory. However, the returned cardinality is NOT exact, but approximated with a standard error.

    0.81% is the standard error, NOT the false positive. For your instance, you can call PFCOUNT HLL to get the approximated number of unique items you put into the HyperLogLog. The returned number should be in range of [10000 * (1 - 0.81%), 10000 * (1 + 0.81%)].

    PFADD returns 1 if the estimated cardinality is changed after executing the command. It returns 0, otherwise. It has nothing to do with false positive.

    It seems what you need is a Bloom Filter, which can tell you if an item already exists in a data set, with false positive. You can implement a Bloom Filter with Redis, of course. And there should be some open source project for that.