databaseredishyperloglog

What is hyperloglog and why is this good for?


I was studying data structures supported by Redis and I was not able to find out an explanation that could make me understand what HyperLogLog is.

How do I use it and why is this good for?


Solution

  • Basically is a kind of Redis Set which uses optimized algorithms to count elements by avoiding a great consumption of memory. The difference between a Set and a HyperLogLog is that with a HyperLogLog you just can add, count unique element and merge some HyperLogLogs in another one, so basically you don't store the members in a HyperLogLog as you could do in a SET, and retrieve them, you just store the occurrences of different members, that is the reason which HyperLogLog doesn't provide a command to retrieve its stored members.

    A clear uses case could be if you want to have a huge SET where you want to count so many times the number of unique data inside the set, you are not interested in which data are inside the set, you are only interested in consuming low memory even when the set grows a lot. For instance, imagine you have a high impact system with a large number of users all of them very active, and you are interested in knowing the number of unique visitors in every webpage of your system. You want to be updated real-time, so you will query every second the unique visitors for every website. You could create a HyperLogLog for every URI in your system, which will represent the webpage and every time a user visits a URL you will PFAAD the user_id:

    PFAAD /api/show/concerts id789989

    then every second you will iterate for every URL-HyperLogLog to get number of unique user-visitors

    PFCOUNT /api/show/concerts

    145542

    PFCOUNT /api/show/open-airs

    25565223

    And you would say, yes but I can get the same functionality by using SET with the benefit of having the user_ids in every set as members. Yes, you can, but you will consume much memory by using sets and every time (second) you query every set to get the numbers of unique visitors with SCARD command, you will spend even much memory, so at least you need to store user_ids for some reason, HyperLogLogs are better options as counters of unique elements. For our use case, imagine having 200-300 sets with around 20-30k of users inside.

    The correspondence between HyperLogLog and Set commands: