I have a very large BigQuery table with web events that I want to aggregate into daily, weekly and monthly numbers of visitors. And I want to join them into the same table.
My initial guess is to do something like the following query. Which works, but since this is a huge table, and the run time is many hours, I would like to know if this is optimal. The queries are a bit simplified. In practice I group by two more fields and I select a certain set of partitions of the table, because running the entire table at once runs into my project's limits in terms of workers.
WITH base_data AS (
SELECT
DATE(event_time) AS date,
DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
visitor_id
FROM
dataset.table
), daily_data AS (
SELECT
date,
COUNT(DISTINCT visitor_id) AS daily_visitors
FROM
base_data
GROUP BY
date
), weekly_data AS (
SELECT
weekly_date AS date,
COUNT(DISTINCT visitor_id) AS weekly_visitors
FROM
base_data
GROUP BY
date
), monthly_data AS (
SELECT
monthly_date AS date,
COUNT(DISTINCT visitor_id) AS monthly_visitors
FROM
base_data
GROUP BY
date
)
SELECT
daily_data.*,
weekly_data.weekly_visitors,
monthly_data.monthly_visitors
FROM
daily_data
LEFT JOIN
weekly_data
ON
DATE_TRUNC(daily_data.date, WEEK(MONDAY)) = weekly_data.date
LEFT JOIN
monthly_data
ON
DATE_TRUNC(daily_data.date, MONTH) = monthly_data.date
Just from the query I would assume that this retrieves the data only once, stores it somehow, do the groupings and join. However, from the Execution graph it seems to scan the same data for every group by statement, which seems inefficient to me.
Alternatively, I thought to do the same thing using window functions, which I am always told BigQuery is optimized for. This makes the Execution graph look more linear, and it doesn't split up looking like it scans the dataset three times. But in practice, this takes longer.
WITH base_data AS (
SELECT
DATE(event_time) AS date,
DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
visitor_id
FROM
dataset.table
), step2 AS (
SELECT
*,
COUNT(DISTINCT visitor_id) OVER (PARTITION BY weekly_date) AS weekly_visitors,
COUNT(DISTINCT visitor_id) OVER (PARTITION BY monthly_date) AS monthly_visitors
FROM
base_data
)
SELECT
date,
COUNT(DISTINCT visitor_id) AS daily_visitors,
ANY_VALUE(weekly_visitors) AS weekly_visitors,
ANY_VALUE(monthly_visitors) AS monthly_visitors,
FROM
step2
GROUP BY
date
I want to know whether it is possible to have BigQuery perform this process significantly more optimally, or if this (the first query) is simply roughly the best you can do. Also, I would like to understand the way BigQuery handles these types of queries better.
Any help is greatly appreciated!
You can use GROUPING SETS for this, which will do separate aggregations for each set of grouping columns. The GROUPING function will tell you whether a column is grouped.
WITH base_data AS (
SELECT
DATE_TRUNC(DATE(event_time), DAY) AS daily_date,
DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
visitor_id
FROM
dataset.table
)
SELECT
CASE
WHEN GROUPING(daily_date) = 0 THEN 'By date'
WHEN GROUPING(weekly_date) = 0 THEN 'By week'
WHEN GROUPING(monthly_date) = 0 THEN 'By month'
END AS grouping_set,
COALESCE(daily_date, weekly_date, monthly_date) AS date,
COUNT(DISTINCT visitor_id) AS visitors
FROM
base_data
GROUP BY GROUPING SETS
(
(daily_date),
(weekly_date),
(monthly_date)
);
I'm not sure how efficient BigQuery is at grouping by a casted date, so it may be worth only casting to date at the end.
WITH base_data AS (
SELECT
DATE_TRUNC(event_time, DAY) AS daily_date,
DATE_TRUNC(event_time, WEEK(MONDAY)) AS weekly_date,
DATE_TRUNC(event_time, MONTH) AS monthly_date,
visitor_id
FROM
dataset.table
)
SELECT
CASE
WHEN GROUPING(daily_date) = 0 THEN 'By date'
WHEN GROUPING(weekly_date) = 0 THEN 'By week'
WHEN GROUPING(monthly_date) = 0 THEN 'By month'
END AS grouping_set,
DATE(COALESCE(daily_date, weekly_date, monthly_date)) AS date,
COUNT(DISTINCT visitor_id) AS visitors
FROM
base_data
GROUP BY GROUPING SETS
(
(daily_date),
(weekly_date),
(monthly_date)
);