I've seen a lot of grouping by timestamps with fixed intervals (e.g. 0-10, 11-20 mins) etc. However I want to ignore duplicate date in a database which have Similar timestamps, that in theory could span fixed width intervals e.g. 10:09 and 10:11. So I want the intervals to be dynamic and relative to each row.
What's the best approach to show all the "unique" entries, grabbing only the first entry where the timestamps are +/- 5 minutes on the others.
Thanks
This is a gaps and islands problem. Let's start by creating a table of test values and populating it:
CREATE TABLE test_values AS (
SELECT id, started_at::TIMESTAMP AS started_at
FROM (VALUES (1, '2024-07-01 00:00:00'),
(2, '2024-07-01 00:10:00'),
(3, '2024-07-01 00:15:00'),
(4, '2024-07-01 00:19:00'),
(5, '2024-07-01 00:25:00')) v (id, started_at));
The following query demonstrates an approach to finding the starting times of each cluster:
WITH
params AS (SELECT INTERVAL '5' MINUTE AS spread),
ng AS (
SELECT tv.id,
tv.started_at,
tv.started_at - LAG(tv.started_at) OVER (ORDER BY tv.started_at) > params.spread IS NOT FALSE AS new_group
FROM params
CROSS JOIN test_values tv)
SELECT ng.*
FROM ng
WHERE ng.new_group
ORDER BY ng.started_at;
The query works by identifying each started_at
that neither overlaps nor is contiguous with the most recent earlier time plus spread
. The comparison is checked with IS NOT FALSE
so that the first time will be identified as starting a group:
id | started_at | new_group |
---|---|---|
1 | 2024-07-01 00:00:00 | true |
2 | 2024-07-01 00:10:00 | true |
5 | 2024-07-01 00:25:00 | true |
If >
is changed to >=
, then clusters will consist of times that are within spread
of adjacent times; i.e, adjacent times that are exactly spread
apart will be in different clusters:
id | started_at | new_group |
---|---|---|
1 | 2024-07-01 00:00:00 | true |
2 | 2024-07-01 00:10:00 | true |
3 | 2024-07-01 00:15:00 | true |
5 | 2024-07-01 00:25:00 | true |