sqlgoogle-bigquery

How to find MIN() of sorted aggregates without using loops


I have a table that my daily script sends the findings to. It looks something like this:

product issues_found run_date
Alpha 2 2024-8-12
Alpha 5 2024-8-11
Alpha 3 2024-8-10
Alpha 0 2024-8-9
Alpha 1 2024-8-8
Alpha 5 2024-8-7
Beta 4 2024-8-12
Beta 3 2024-8-11

I want to find out the date when the current set of issues started for every product (i.e., the latest date when issues_found turned from 0 to non-zero), and add it to another column. So the table should become something like:

product issues_found run_date issues_started
Alpha 2 2024-8-12 2024-8-10
Alpha 5 2024-8-11 2024-8-10
Alpha 3 2024-8-10 2024-8-10
Alpha 0 2024-8-9 2024-8-10
Alpha 1 2024-8-8 2024-8-10
Alpha 5 2024-8-7 2024-8-10
Beta 4 2024-8-12 2024-8-11
Beta 3 2024-8-11 2024-8-11

My approach is to iterate over date-sorted records for every product and break when 0 is encountered:

DECLARE found_date DATE;
FOR row in (SELECT * FROM my_table)
DO
  SET found_date = NULL;
  FOR historical_entry in (SELECT * FROM my_table WHERE product = row.product ORDER BY run_date DESC)
  DO
    IF historical_entry.issues_found <> 0 THEN
      SET found_date = historical_entry.run_date;
    ELSE
      BREAK;
    END IF;
  END FOR;
  UPDATE my_table SET issues_started = found_date where product = row.product;
END FOR;

It's a very procedural approach and although it works, it takes hours when run against hundreds of products and thousands of records. Is there a better approach than two nested loops? I tried using MIN(run_date) over aggregates of non-zero issues_found rows but couldn't quite get it right.


Solution

  • Consider below approach

    WITH issue_starts AS (
        SELECT
            product,
            run_date,
            issues_found,
            CASE
                WHEN issues_found <> 0 AND IFNULL(LAG(issues_found) OVER (PARTITION BY product ORDER BY run_date), 0) = 0 
                THEN run_date
            END AS start_of_issues
        FROM your_table
    ), latest_issue_start AS (
        SELECT
            product,
            run_date,
            issues_found,
            MAX(start_of_issues) OVER (PARTITION BY product ORDER BY run_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS issues_started
        FROM issue_starts
    )
    SELECT
        t.product,
        t.issues_found,
        t.run_date,
        l.issues_started
    FROM your_table t
    JOIN latest_issue_start l
    ON t.product = l.product AND t.run_date = l.run_date
    ORDER BY t.product, t.run_date DESC
    

    if applied to sample data in your question - output is

    enter image description here