sqlwindow-functionsamazon-athenaprestotrino

Set True to duplicate records that occur within 1 hour window


I’d like to set is_duplicate to TRUE for records that occur within a 1-hour window of an earlier record.

The rule is that each record should check against the most recent prior record where is_duplicate is FALSE, not just the immediately preceding row.

For example, in row 3, the timestamp should be compared with row 1 (the last non duplicate) rather than row 2, which is already marked as a duplicate.

I know that this might need to be done in a post processing job, but is there any way to do this purely with SQL?

I'm running these queries in Athena.

Row data_timestamp is_duplicate explanation
1 2024-09-30 15:55:50 FALSE there's no preceding logging
2 2024-09-30 16:55:50 TRUE diff with row 1 is less than or equal to 1 hour
3 2024-09-30 17:36:50 FALSE diff with row 1 is larger than 1 hour
4 2024-09-30 17:40:50 TRUE diff with row 3 is less than or equal to 1 hour
5 2024-09-30 17:50:03 TRUE diff with row 3 is less than or equal to 1 hour
6 2024-09-30 20:27:24 FALSE diff with row 3 is larger than 1 hour
7 2024-09-30 21:27:24 TRUE diff with row 6 is less than or equal to 1 hour
8 2024-09-30 22:22:24 FALSE diff with row 6 is larger than 1 hour

Using window functions on Athena didn't work since it still doesn't know which row is the most recent prior record while calculating is_duplicate


Solution

  • I’d like to set is_duplicate to TRUE for records that occur within a 1-hour window of an earlier record.

    As far as I understand this can be rephrased as "starting the first record all the following within an 1 hour window should be marked as duplicates, then skip to the next following record", which is what can be "easily" done with the MATCH_RECOGNIZE function (AFAIK Athena should support it) :

    -- sample data
    WITH dataset(Row, data_timestamp, target_is_duplicate) as (
        values (1,  timestamp '2024-09-30 15:55:50', FALSE),
        (2, timestamp '2024-09-30 16:55:50', TRUE),
        (3, timestamp '2024-09-30 17:36:50', FALSE),
        (4, timestamp '2024-09-30 17:40:50', TRUE),
        (5, timestamp '2024-09-30 17:50:03', TRUE),
        (6, timestamp '2024-09-30 20:27:24', FALSE),
        (7, timestamp '2024-09-30 21:27:24', TRUE),
        (8, timestamp '2024-09-30 22:22:24', FALSE)
    )
    
    -- query
    SELECT * FROM dataset MATCH_RECOGNIZE(
         ORDER BY data_timestamp, row
         MEASURES
                  A.row != row AS is_duplicate -- our target value
         ALL ROWS PER MATCH -- take all rows
         AFTER MATCH SKIP PAST LAST ROW -- arguably self explanatory 
         PATTERN (A B*) -- pattern, A is the first row in the group, followed by zero or more records within the "window"
         DEFINE
                  B AS data_timestamp <= FIRST(data_timestamp) + interval '1' hour
         )
    order by row;
    

    Output:

    Row data_timestamp is_duplicate target_is_duplicate
    1 2024-09-30 15:55:50 false false
    2 2024-09-30 16:55:50 true true
    3 2024-09-30 17:36:50 false false
    4 2024-09-30 17:40:50 true true
    5 2024-09-30 17:50:03 true true
    6 2024-09-30 20:27:24 false false
    7 2024-09-30 21:27:24 true true
    8 2024-09-30 22:22:24 false false