sqlpostgresqlsortinggaps-and-islandsself-join

How to performantly self-join same table multiple times with ascending and descending sort order?


Data Definition

I have a status table like so, attributes not relevant to the question are omitted:

id created value
1 2024-06-24T13:01:00 error
2 2024-06-24T13:02:00 ok
3 2024-06-24T13:03:00 warning
4 2024-06-24T13:04:00 error
5 2024-06-24T13:05:00 error
6 2024-06-24T13:05:30 error
7 2024-06-24T13:06:00 ok
8 2024-06-24T13:07:00 error
9 2024-06-24T13:07:30 error
10 2024-06-24T13:08:00 warning
11 2024-06-24T13:09:00 error

Task at Hand

I'd like to collapse the table into a bloc-like view, where the "error" blocs (1, 4-6, 8-9, 11) are collapsed into a single line, but with the respective before and after states and timestamps also included like so:

error_first_occurance value_before timestamp_before value_after timestamp_after
2024-06-24T13:01:00 NULL NULL ok 2024-06-24T13:02:00
2024-06-24T13:04:00 warning 2024-06-24T13:03:00 ok 2024-06-24T13:06:00
2024-06-24T13:07:00 ok 2024-06-24T13:06:00 warning 2024-06-24T13:08:00
2024-06-24T13:09:00 warning 2024-06-24T13:08:00 NULL NULL

Solution Options

As far as I know, I'd have the following options:

1. Create sub queries

SELECT 
  value AS "value_before"
  -- created AS "timestamp_before"
FROM t AS t1 
WHERE t1.value != 'error' AND t1.created < t.created 
ORDER BY t1.created DESC 
LIMIT 1
SELECT 
  value AS "value_after"
  -- created AS "timestamp_after"
FROM t AS t2 
WHERE t2.value != 'error' AND t2.created > t.created 
ORDER BY t2.created ASC
LIMIT 1

2. LATERAL JOIN

Using a lateral JOIN would presumably save half of the queries, as I could extract two fields at the same time (value, created) which is not possible with the sub queries.

However, the engine would likely need to execute the sorting again for each row that the main query produces. Therefore, I have not pursued this further.

3. SELF JOINs

Here, I'd create two derived tables t1 and t2 (CTE in this case would be the same thing), with sorting by created DESC (newest first) and ASC (oldest first) and join them on t1.value != 'error' AND t1.created < t.created for the "newest that is not an error and older that the t.created", and on t2.value != 'error' AND t2.created > t.created for the "oldest that is not an error and younger that the t.created".

SELECT
  t.created  "error_first_occurance",
  t1.value   "value_before",
  t1.created "timestamp_before",
  t2.value   "value_after",
  t2.created "timestamp_after"
FROM t LEFT
JOIN (
  SELECT value, created
  FROM t WHERE value != 'error'
  ORDER BY created DESC
) t1 ON t1.created < t.created LEFT
JOIN (
  SELECT value, created
  FROM t WHERE value != 'error'
  ORDER BY created ASC
) t2 ON t2.created > t.created
WHERE 
  t.value = 'error'
ORDER BY 
  t.created

This already yields the correct "superset", but as I cannot rely on t1.value or t2.value having only the same values between blocs (see line 2-3), I don't know how to tell the DBMS that I want the MAX()/MIN() value from the created timestamp, and the same value from that record.

Question

The challenge seems to be that I not only need the created timestamp field, which could be easily extracted with aggregation functions, but I also need the string value from the same record, which is not accessible to the aggregation functions.

Therefore, on a presumed real table with many 100ks of records:


Solution

  • That's a gaps-and-island problem. As already pointed out by @Knoep, you can use window functions:

    select distinct on(island_n)
           first_value(created)         over w2 as error_first_occurance
          ,first_value(value_before)    over w2 as value_before
          ,first_value(timestamp_before)over w2 as timestamp_before
          ,last_value(value_after)      over w2 as value_after
          ,last_value(timestamp_after)  over w2 as timestamp_after
    from(select*,count(*)filter(where value<>'error')over w1 as island_n 
                ,value='error'
                 and value is distinct from lag(value)over w1 as is_starting_island
                ,lag(value)   over w1 as value_before
                ,lag(created) over w1 as timestamp_before
                ,lead(value)  over w1 as value_after
                ,lead(created)over w1 as timestamp_after
         from t window w1 as(order by created) )_
    where value='error'
    window w2 as(partition by island_n order by created
                 rows between unbounded preceding and unbounded following)
    order by island_n,1;
    

    It runs through your table finding islands (uninterrupted sequences of error) and collects values from their preceding and following rows. After that, it takes one set of first/last values per island using distinct on.

    It's about as fast as the quadruple self-join but it doesn't have to rely on the two quiet assumptions that the id column has no gaps and that it follows the same order as created - that one fails if any of the two doesn't hold. The window-over-window also doesn't require the additional effort of maintaining the pre-sorted, gapless id.