sqlgoogle-bigqueryhyperloglog

BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)


Example relevant table schema:

+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING  |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC   | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+

I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.

At first, I attempted to use a window function, similar to the answer here. https://stackoverflow.com/a/27574474

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    user_id
  FROM
    `fake-table`)
SELECT
  day,
  SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
  daily
GROUP BY
  1
ORDER BY
  1 DESC

However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.

The next thing I tried is to use the following solution https://stackoverflow.com/a/47659590 - concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.

WITH daily AS (
  SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
  FROM `fake-table`  
  GROUP BY day
), temp2 AS (
  SELECT
    day, 
    STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
  FROM daily
)

SELECT day, 
  (SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2

order by 1 desc

However this quickly ran out of memory with anything large.

Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.

I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    HLL_COUNT.INIT(user_id) sketch
  FROM
    `fake-table`
  GROUP BY
    1
  ORDER BY
    1 DESC),

  rolling AS (
  SELECT
    day,
    HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
    FROM daily)

SELECT
  day,
  HLL_COUNT.EXTRACT(rolling_sketch)
FROM
  rolling
ORDER BY
  1 

"Image of the error - Function MERGE_PARTIAL is not supported"

Any ideas why this error happens or how to adjust?


Solution

  • Combine HLL_COUNT.INIT and HLL_COUNT.MERGE. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER.

    #standardSQL
    SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
     , HLL_COUNT.MERGE(sketch) unique_90_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
    FROM (
      SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
      FROM `bigquery-public-data.stackoverflow.posts_questions` 
      WHERE EXTRACT(YEAR FROM creation_date)=2017
      GROUP BY 1
    ), UNNEST(GENERATE_ARRAY(1, 90)) i
    GROUP BY 1
    ORDER BY date_grp
    

    enter image description here