postgresqltableau-apidistinct-valueshyperloglog

Execute extract on Tableau for distinct count using HLL


I have a somewhat huge table (130 million rows), that I am able to crunch on the same server in under 10 minutes, and produce a slimmed-down, pre-aggregated table, that works just fine and everyone is happy to use it.

The table is grouped by around 6 columns, and the rest are aggregated with SUM() or a variety of COUNT(DISTINCT myColumn) that's optimized with HLL.

hll_union_agg(hll_add(hll_empty(15, 4), hll_hash_bigint(my_column)))

People who use this small resulting table, can group by just 1 or 2 of the original columns, to get even a broader summary. Only at the final step the actual count is produced with

hll_cardinality(hll_union_agg(my_intermediary_hll_set))

Results are produced in milliseconds.

The problem is moving this to Tableau as an extract in Hyper, because people want to use it there. Tableau doesn't know about HLL, and it wants to do the distinct-count itself. However pushing all of that raw data is extremely slow (like 5 hours slow).

I am willing to compromise on speed, but not by going from 10 minutes to 5 hours. Is there a way to transfer the set, that is currently represented as HLL, as a whole column? Like an array or JSON or something?

I found a similar question here, but it's more about Cassandra than about Tableau extracts.

Finding the distinct count from a cassandra column using Tableau

To sum it up:
How do I pass in pre-aggregated rows to Tableau, and still be able to produce valid COUNT(DISTINCT x) calculations on Tableau side?


Solution

  • I managed to solve this issue using pass-through SQL functions in Tableau. After adding the data source (LIVE only) that contains HLL columns, creating a custom computed column using an expression like this:

    rawsqlagg_int("hll_cardinality(hll_union_agg(%1))", [Set of entities])
    

    I got the result that I needed. Amazing and fast.