sqlrecursiongoogle-bigqueryrecursive-queryrecursive-cte

Recursive CTE to remove duplicates


I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status value.

I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(

See below for an example of 2 iterations:

# data has multiple instances of days with more than one status (* = duplicate)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active   |

# first iteration with removed rows (**)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active   |

# second iteration with removed rows (***)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active   |

# final desired set of deduplicated rows
| date       | status   |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active   |
| 2024-11-04 | active   |
| 2024-11-05 | active   |

I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!


Solution

  • WITH a AS (
      SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
      FROM sample_data
      GROUP BY date
    ),
    b AS (
      SELECT
        date,
        COALESCE(
          status,
          FIRST_VALUE(status IGNORE NULLS) OVER (
            ORDER BY date
            ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
          )
        ) AS final_status
      FROM a
    )
    SELECT date, final_status AS status
    FROM b
    ORDER BY date;
    

    Output:

    date status
    2024-11-01 inactive
    2024-11-02 inactive
    2024-11-03 active
    2024-11-04 active
    2024-11-05 active