sqlpostgresqlwindow-functionsgaps-and-islands

Postgres Select Row From Group With Opposite Sort Order


I have a Postgres table foobar with the following schema:

id: integer
group: integer
foo: integer
bar: integer
timestamp: integer

I keep track of updates to various groups, and those updates mutate the foo and bar properties. Every time I receive an update, I store the timestamp.

Here's an example value I could have in the databse:

+------+---------+-------+-------+-------------+
| "id" | "group" | "foo" | "bar" | "timestamp" |
+------+---------+-------+-------+-------------+
| 1    | 1       | 10    | 20    | 1           |
| 2    | 1       | 11    | 19    | 2           |
| 3    | 1       | 10    | 20    | 3           |
| 4    | 1       | 10    | 20    | 4           |
+------+---------+-------+-------+-------------+

Oftentimes, the updates I receive are identical. A particularly critical piece of information I'm trying to extract is when I first received the combination of values that is current — but not for the first time ever, but rather the first update after which there haven't been any changes.

A naïve approach would be the following query:

SELECT DISTINCT ON ("group", foo, bar) *
FROM foobar
ORDER BY "group", foo, bar, timestamp DESC;

However, that query would return the last row, which has the latest timestamp. If I switch timestamp to ASC, I would get the very first row, because I have seen the exact foo/bar value combination prior to the update at timestamp 2.

The intuitive thing would have been to simply move the timestamp DESC sort command prior to foo, but Postgres does not allow that. Something like MySQL's HAVING operation could also have come in handy, but Postgres unfortunately doesn't support that.

An incredibly inefficient approach I could take is programmatically iterate through each group, get the latest row, and then fetch all rows in descending timestamp order and stop as soon as I observe a change, but it seems that letting a database do this sort of operation would be wiser.

I am quite certain that I'm missing something obvious, but would greatly appreciate any help. Thanks!


Solution

  • That's a gaps-and-islands problem.
    You can compare each row's (foo,bar) to previous row's using lag(). The window definition lets you only check those coming from the same group, in ascending order.

    From that, you can get the latest "changing update" per group with a distinct on() or the 1=row_number()over w, as already described by @Stefanov.sm.
    Demo at db<>fiddle:

    select distinct on("group") id,"group",foo,bar,"timestamp"
    from (select *,coalesce(   foo<>lead(foo)over w1
                            or bar<>lead(bar)over w1,true) is_diff_from_prev
          from foobar
          window w1 as (partition by "group" order by "timestamp" desc))_
    where is_diff_from_prev
    order by "group","timestamp" desc;
    
    id group foo bar timestamp
    3 1 10 20 3

    For each group, this starts from the latest record and seeks the first one that changed either value. It takes around 3s on 200k rows with 4k groups, 7% of which have duplicate, non-changing updates among the most recent rows.