I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status
value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status
value.
I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(
See below for an example of 2 iterations:
# data has multiple instances of days with more than one status (* = duplicate)
| date | status |
|------------|----------|
| 2024-11-01 | active |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active |
# first iteration with removed rows (**)
| date | status |
|------------|----------|
| 2024-11-01 | active |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active |
# second iteration with removed rows (***)
| date | status |
|------------|----------|
| 2024-11-01 | active |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active |
# final desired set of deduplicated rows
| date | status |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active |
| 2024-11-04 | active |
| 2024-11-05 | active |
I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!
NULL
for dates with multiple statuses.FIRST_VALUE
to find next known status for dates with NULL
status.WITH a AS (
SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
FROM sample_data
GROUP BY date
),
b AS (
SELECT
date,
COALESCE(
status,
FIRST_VALUE(status IGNORE NULLS) OVER (
ORDER BY date
ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
)
) AS final_status
FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;
Output:
date | status |
---|---|
2024-11-01 | inactive |
2024-11-02 | inactive |
2024-11-03 | active |
2024-11-04 | active |
2024-11-05 | active |