sql window-functions amazon-athena presto trino

Set True to duplicate records that occur within 1 hour window

I’d like to set is_duplicate to TRUE for records that occur within a 1-hour window of an earlier record.

The rule is that each record should check against the most recent prior record where is_duplicate is FALSE, not just the immediately preceding row.

For example, in row 3, the timestamp should be compared with row 1 (the last non duplicate) rather than row 2, which is already marked as a duplicate.

I know that this might need to be done in a post processing job, but is there any way to do this purely with SQL?

I'm running these queries in Athena.

Row	data_timestamp	is_duplicate	explanation
1	2024-09-30 15:55:50	FALSE	there's no preceding logging
2	2024-09-30 16:55:50	TRUE	diff with row 1 is less than or equal to 1 hour
3	2024-09-30 17:36:50	FALSE	diff with row 1 is larger than 1 hour
4	2024-09-30 17:40:50	TRUE	diff with row 3 is less than or equal to 1 hour
5	2024-09-30 17:50:03	TRUE	diff with row 3 is less than or equal to 1 hour
6	2024-09-30 20:27:24	FALSE	diff with row 3 is larger than 1 hour
7	2024-09-30 21:27:24	TRUE	diff with row 6 is less than or equal to 1 hour
8	2024-09-30 22:22:24	FALSE	diff with row 6 is larger than 1 hour

Using window functions on Athena didn't work since it still doesn't know which row is the most recent prior record while calculating is_duplicate

Solution

I’d like to set is_duplicate to TRUE for records that occur within a 1-hour window of an earlier record.

As far as I understand this can be rephrased as "starting the first record all the following within an 1 hour window should be marked as duplicates, then skip to the next following record", which is what can be "easily" done with the MATCH_RECOGNIZE function (AFAIK Athena should support it) :

-- sample data
WITH dataset(Row, data_timestamp, target_is_duplicate) as (
    values (1,  timestamp '2024-09-30 15:55:50', FALSE),
    (2, timestamp '2024-09-30 16:55:50', TRUE),
    (3, timestamp '2024-09-30 17:36:50', FALSE),
    (4, timestamp '2024-09-30 17:40:50', TRUE),
    (5, timestamp '2024-09-30 17:50:03', TRUE),
    (6, timestamp '2024-09-30 20:27:24', FALSE),
    (7, timestamp '2024-09-30 21:27:24', TRUE),
    (8, timestamp '2024-09-30 22:22:24', FALSE)
)

-- query
SELECT * FROM dataset MATCH_RECOGNIZE(
     ORDER BY data_timestamp, row
     MEASURES
              A.row != row AS is_duplicate -- our target value
     ALL ROWS PER MATCH -- take all rows
     AFTER MATCH SKIP PAST LAST ROW -- arguably self explanatory 
     PATTERN (A B*) -- pattern, A is the first row in the group, followed by zero or more records within the "window"
     DEFINE
              B AS data_timestamp <= FIRST(data_timestamp) + interval '1' hour
     )
order by row;

Output:

Row	data_timestamp	is_duplicate	target_is_duplicate
1	2024-09-30 15:55:50	false	false
2	2024-09-30 16:55:50	true	true
3	2024-09-30 17:36:50	false	false
4	2024-09-30 17:40:50	true	true
5	2024-09-30 17:50:03	true	true
6	2024-09-30 20:27:24	false	false
7	2024-09-30 21:27:24	true	true
8	2024-09-30 22:22:24	false	false