I’d like to set is_duplicate
to TRUE
for records that occur within a 1-hour window of an earlier record.
The rule is that each record should check against the most recent prior record where is_duplicate
is FALSE
, not just the immediately preceding row.
For example, in row 3, the timestamp should be compared with row 1 (the last non duplicate) rather than row 2, which is already marked as a duplicate.
I know that this might need to be done in a post processing job, but is there any way to do this purely with SQL?
I'm running these queries in Athena.
Row | data_timestamp | is_duplicate | explanation |
---|---|---|---|
1 | 2024-09-30 15:55:50 | FALSE | there's no preceding logging |
2 | 2024-09-30 16:55:50 | TRUE | diff with row 1 is less than or equal to 1 hour |
3 | 2024-09-30 17:36:50 | FALSE | diff with row 1 is larger than 1 hour |
4 | 2024-09-30 17:40:50 | TRUE | diff with row 3 is less than or equal to 1 hour |
5 | 2024-09-30 17:50:03 | TRUE | diff with row 3 is less than or equal to 1 hour |
6 | 2024-09-30 20:27:24 | FALSE | diff with row 3 is larger than 1 hour |
7 | 2024-09-30 21:27:24 | TRUE | diff with row 6 is less than or equal to 1 hour |
8 | 2024-09-30 22:22:24 | FALSE | diff with row 6 is larger than 1 hour |
Using window functions on Athena didn't work since it still doesn't know which row is the most recent prior record while calculating is_duplicate
I’d like to set
is_duplicate
toTRUE
for records that occur within a 1-hour window of an earlier record.
As far as I understand this can be rephrased as "starting the first record all the following within an 1 hour window should be marked as duplicates, then skip to the next following record", which is what can be "easily" done with the MATCH_RECOGNIZE
function (AFAIK Athena should support it) :
-- sample data
WITH dataset(Row, data_timestamp, target_is_duplicate) as (
values (1, timestamp '2024-09-30 15:55:50', FALSE),
(2, timestamp '2024-09-30 16:55:50', TRUE),
(3, timestamp '2024-09-30 17:36:50', FALSE),
(4, timestamp '2024-09-30 17:40:50', TRUE),
(5, timestamp '2024-09-30 17:50:03', TRUE),
(6, timestamp '2024-09-30 20:27:24', FALSE),
(7, timestamp '2024-09-30 21:27:24', TRUE),
(8, timestamp '2024-09-30 22:22:24', FALSE)
)
-- query
SELECT * FROM dataset MATCH_RECOGNIZE(
ORDER BY data_timestamp, row
MEASURES
A.row != row AS is_duplicate -- our target value
ALL ROWS PER MATCH -- take all rows
AFTER MATCH SKIP PAST LAST ROW -- arguably self explanatory
PATTERN (A B*) -- pattern, A is the first row in the group, followed by zero or more records within the "window"
DEFINE
B AS data_timestamp <= FIRST(data_timestamp) + interval '1' hour
)
order by row;
Output:
Row | data_timestamp | is_duplicate | target_is_duplicate |
---|---|---|---|
1 | 2024-09-30 15:55:50 | false | false |
2 | 2024-09-30 16:55:50 | true | true |
3 | 2024-09-30 17:36:50 | false | false |
4 | 2024-09-30 17:40:50 | true | true |
5 | 2024-09-30 17:50:03 | true | true |
6 | 2024-09-30 20:27:24 | false | false |
7 | 2024-09-30 21:27:24 | true | true |
8 | 2024-09-30 22:22:24 | false | false |