sqlgoogle-cloud-platformgoogle-bigqueryquery-optimization

What is the optimal BigQuery SQL query to group the same data multiple times over different groups?


I have a very large BigQuery table with web events that I want to aggregate into daily, weekly and monthly numbers of visitors. And I want to join them into the same table.

My initial guess is to do something like the following query. Which works, but since this is a huge table, and the run time is many hours, I would like to know if this is optimal. The queries are a bit simplified. In practice I group by two more fields and I select a certain set of partitions of the table, because running the entire table at once runs into my project's limits in terms of workers.

WITH base_data AS (
  SELECT
    DATE(event_time) AS date,
    DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
    DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
    visitor_id
  FROM
    dataset.table
), daily_data AS (
  SELECT
    date,
    COUNT(DISTINCT visitor_id) AS daily_visitors
  FROM
    base_data
  GROUP BY
    date
), weekly_data AS (
  SELECT
    weekly_date AS date,
    COUNT(DISTINCT visitor_id) AS weekly_visitors
  FROM
    base_data
  GROUP BY
    date
), monthly_data AS (
  SELECT
    monthly_date AS date,
    COUNT(DISTINCT visitor_id) AS monthly_visitors
  FROM
    base_data
  GROUP BY
    date
)

SELECT
  daily_data.*,
  weekly_data.weekly_visitors,
  monthly_data.monthly_visitors
FROM
  daily_data
LEFT JOIN
  weekly_data
ON
  DATE_TRUNC(daily_data.date, WEEK(MONDAY)) = weekly_data.date
LEFT JOIN
  monthly_data
ON
  DATE_TRUNC(daily_data.date, MONTH) = monthly_data.date

Just from the query I would assume that this retrieves the data only once, stores it somehow, do the groupings and join. However, from the Execution graph it seems to scan the same data for every group by statement, which seems inefficient to me.

screenshot of Execution graph

Alternatively, I thought to do the same thing using window functions, which I am always told BigQuery is optimized for. This makes the Execution graph look more linear, and it doesn't split up looking like it scans the dataset three times. But in practice, this takes longer.

WITH base_data AS (
  SELECT
    DATE(event_time) AS date,
    DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
    DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
    visitor_id
  FROM
    dataset.table
), step2 AS (
  SELECT
    *,
    COUNT(DISTINCT visitor_id) OVER (PARTITION BY weekly_date) AS weekly_visitors,
    COUNT(DISTINCT visitor_id) OVER (PARTITION BY monthly_date) AS monthly_visitors
  FROM
    base_data
)

SELECT
  date,
  COUNT(DISTINCT visitor_id) AS daily_visitors,
  ANY_VALUE(weekly_visitors) AS weekly_visitors,
  ANY_VALUE(monthly_visitors) AS monthly_visitors,
FROM
  step2
GROUP BY
  date

I want to know whether it is possible to have BigQuery perform this process significantly more optimally, or if this (the first query) is simply roughly the best you can do. Also, I would like to understand the way BigQuery handles these types of queries better.

Any help is greatly appreciated!


Solution

  • You can use GROUPING SETS for this, which will do separate aggregations for each set of grouping columns. The GROUPING function will tell you whether a column is grouped.

    WITH base_data AS (
      SELECT
        DATE_TRUNC(DATE(event_time), DAY) AS daily_date,
        DATE_TRUNC(DATE(event_time), WEEK(MONDAY)) AS weekly_date,
        DATE_TRUNC(DATE(event_time), MONTH) AS monthly_date,
        visitor_id
      FROM
        dataset.table
    )
    SELECT
      CASE
        WHEN GROUPING(daily_date) = 0 THEN 'By date'
        WHEN GROUPING(weekly_date) = 0 THEN 'By week'
        WHEN GROUPING(monthly_date) = 0 THEN 'By month'
      END AS grouping_set,
      COALESCE(daily_date, weekly_date, monthly_date) AS date,
      COUNT(DISTINCT visitor_id) AS visitors
    FROM
      base_data
    GROUP BY GROUPING SETS
      (
        (daily_date),
        (weekly_date),
        (monthly_date)
      );
    

    I'm not sure how efficient BigQuery is at grouping by a casted date, so it may be worth only casting to date at the end.

    WITH base_data AS (
      SELECT
        DATE_TRUNC(event_time, DAY) AS daily_date,
        DATE_TRUNC(event_time, WEEK(MONDAY)) AS weekly_date,
        DATE_TRUNC(event_time, MONTH) AS monthly_date,
        visitor_id
      FROM
        dataset.table
    )
    SELECT
      CASE
        WHEN GROUPING(daily_date) = 0 THEN 'By date'
        WHEN GROUPING(weekly_date) = 0 THEN 'By week'
        WHEN GROUPING(monthly_date) = 0 THEN 'By month'
      END AS grouping_set,
      DATE(COALESCE(daily_date, weekly_date, monthly_date)) AS date,
      COUNT(DISTINCT visitor_id) AS visitors
    FROM
      base_data
    GROUP BY GROUPING SETS
      (
        (daily_date),
        (weekly_date),
        (monthly_date)
      );