sqlgoogle-bigquerytimestampsql-deletebq

Remove old duplicate rows in BQ based on timestamp


I have a BQ table with duplicate (x2 times) rows of the same ad_id.

I want to delete old rows with ts > 120 min where there is a newer one with the same ad_id (Schema contains timestamp, ad_id, value. But there is not rowId).

This is my try, is there a nicer way to do so?

DELETE FROM {table_full_name} o
       WHERE timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 120 MINUTE) AND timestamp in ( 
                                            SELECT MIN(timestamp) 
                                            FROM  {table_full_name}  i 
                                            WHERE i.ad_id=o.ad_id 
                                            GROUP  BY ad_id) 

Data example:

`ad-id` | `ts`             | `value`   |
`1`     | Sep-1-2021 12:01 | `Scanned` |
`2`     | Sep-1-2021 12:02 | `Error`   |
`1`     | Sep-1-2021 12:03 | `Removed` |

I want to clean it up to be:

`ad-id` | `ts`             | `value`   |
`2`     | Sep-1-2021 12:02 | `Error`   |
`1`     | Sep-1-2021 12:03 | `Removed` |

I saw this post, but BQ doesn't support auto-increment for row-id.

I saw this post. But how can I modify it without the ts interval (as it's unknown).


Solution

  • You can try this script. Used COUNT() with HAVING to pull duplicate records with timestamp older than 120 minutes from current time using TIMESTAMP_DIFF.

    DELETE
    FROM `table_full_name` 
    WHERE ad_id in (SELECT ad_id 
                        FROM `table_full_name`
                        GROUP BY ad_id
                        HAVING COUNT(ad_id) > 1) 
    AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, MINUTE) > 120
    

    Before:

    enter image description here

    After:

    enter image description here