I have a status table like so, attributes not relevant to the question are omitted:
id | created | value |
---|---|---|
1 | 2024-06-24T13:01:00 | error |
2 | 2024-06-24T13:02:00 | ok |
3 | 2024-06-24T13:03:00 | warning |
4 | 2024-06-24T13:04:00 | error |
5 | 2024-06-24T13:05:00 | error |
6 | 2024-06-24T13:05:30 | error |
7 | 2024-06-24T13:06:00 | ok |
8 | 2024-06-24T13:07:00 | error |
9 | 2024-06-24T13:07:30 | error |
10 | 2024-06-24T13:08:00 | warning |
11 | 2024-06-24T13:09:00 | error |
I'd like to collapse the table into a bloc-like view, where the "error" blocs (1
, 4-6
, 8-9
, 11
) are collapsed into a single line, but with the respective before and after states and timestamps also included like so:
error_first_occurance | value_before | timestamp_before | value_after | timestamp_after |
---|---|---|---|---|
2024-06-24T13:01:00 | NULL | NULL | ok | 2024-06-24T13:02:00 |
2024-06-24T13:04:00 | warning | 2024-06-24T13:03:00 | ok | 2024-06-24T13:06:00 |
2024-06-24T13:07:00 | ok | 2024-06-24T13:06:00 | warning | 2024-06-24T13:08:00 |
2024-06-24T13:09:00 | warning | 2024-06-24T13:08:00 | NULL | NULL |
As far as I know, I'd have the following options:
1. Create sub queries
SELECT
value AS "value_before"
-- created AS "timestamp_before"
FROM t AS t1
WHERE t1.value != 'error' AND t1.created < t.created
ORDER BY t1.created DESC
LIMIT 1
SELECT
value AS "value_after"
-- created AS "timestamp_after"
FROM t AS t2
WHERE t2.value != 'error' AND t2.created > t.created
ORDER BY t2.created ASC
LIMIT 1
2. LATERAL JOIN
Using a lateral JOIN would presumably save half of the queries, as I could extract two fields at the same time (value
, created
) which is not possible with the sub queries.
However, the engine would likely need to execute the sorting again for each row that the main query produces. Therefore, I have not pursued this further.
3. SELF JOINs
Here, I'd create two derived tables t1
and t2
(CTE in this case would be the same thing), with sorting by created
DESC
(newest first) and ASC
(oldest first) and join them on t1.value != 'error' AND t1.created < t.created
for the "newest that is not an error and older that the t.created", and on t2.value != 'error' AND t2.created > t.created
for the "oldest that is not an error and younger that the t.created".
SELECT
t.created "error_first_occurance",
t1.value "value_before",
t1.created "timestamp_before",
t2.value "value_after",
t2.created "timestamp_after"
FROM t LEFT
JOIN (
SELECT value, created
FROM t WHERE value != 'error'
ORDER BY created DESC
) t1 ON t1.created < t.created LEFT
JOIN (
SELECT value, created
FROM t WHERE value != 'error'
ORDER BY created ASC
) t2 ON t2.created > t.created
WHERE
t.value = 'error'
ORDER BY
t.created
This already yields the correct "superset", but as I cannot rely on t1.value
or t2.value
having only the same values between blocs (see line 2
-3
), I don't know how to tell the DBMS that I want the MAX()
/MIN()
value from the created
timestamp, and the same value
from that record.
The challenge seems to be that I not only need the created
timestamp field, which could be easily extracted with aggregation functions, but I also need the string value
from the same record, which is not accessible to the aggregation functions.
Therefore, on a presumed real table with many 100ks of records:
That's a gaps-and-island problem. As already pointed out by @Knoep, you can use window functions:
select distinct on(island_n)
first_value(created) over w2 as error_first_occurance
,first_value(value_before) over w2 as value_before
,first_value(timestamp_before)over w2 as timestamp_before
,last_value(value_after) over w2 as value_after
,last_value(timestamp_after) over w2 as timestamp_after
from(select*,count(*)filter(where value<>'error')over w1 as island_n
,value='error'
and value is distinct from lag(value)over w1 as is_starting_island
,lag(value) over w1 as value_before
,lag(created) over w1 as timestamp_before
,lead(value) over w1 as value_after
,lead(created)over w1 as timestamp_after
from t window w1 as(order by created) )_
where value='error'
window w2 as(partition by island_n order by created
rows between unbounded preceding and unbounded following)
order by island_n,1;
It runs through your table finding islands (uninterrupted sequences of error
) and collects values from their preceding and following rows. After that, it takes one set of first/last values per island using distinct on
.
It's about as fast as the quadruple self-join
but it doesn't have to rely on the two quiet assumptions that the id
column has no gaps and that it follows the same order as created
- that one fails if any of the two doesn't hold. The window-over-window also doesn't require the additional effort of maintaining the pre-sorted, gapless id
.