amazon-redshifthyperloglog

Postgresql-hll (or another Hyperloglog data type/structure) for Redshift


Need to be able to report on Unique Visitors, but would like to avoid pre-computing every possible permutation of keys and creating multiple tables.

As a simplistic example, let's say I need to report Monthly Uniques in a table that has the following columns

In Druid and Redis, Hyperloglog data type will take care of this (assuming a small margin of error is acceptable), where I would be able to run a query by any combination of the dimensions and receive a viable estimate of the uniques.

Closest I was able to find in PostgreSQL world is postgresql-hll plugin, but it seems to be for PostgreSQL 9.0+.

Is there a way to represent this in Redshift without either having to pre-compute or store visitor IDs (greatly inflating the table size, but allowing to use RedShift's "approximate count" hll implementation)?

Note: RedShift is the preferred platform, but I already know that other self-hosted PostgreSQL forks can support this, such as CitusDB. Looking for ways to do this with RedShift.


Solution

  • Redshift, while technically postgresql-derived, was forked over ten years ago. It still speaks the same line protocol as postgres, but its code has diverged a great deal. Among other incompatibilities, it no longer allows for custom datatypes. That means that the type of plugin you're looking to use is not going to be feasible.

    However, as you pointed out, if you're able to get all the raw data in, you can use the built-in approximation capability.