I have a Postgres table foobar
with the following schema:
id: integer
group: integer
foo: integer
bar: integer
timestamp: integer
I keep track of updates to various groups, and those updates mutate the foo
and bar
properties. Every time I receive an update, I store the timestamp.
Here's an example value I could have in the databse:
+------+---------+-------+-------+-------------+
| "id" | "group" | "foo" | "bar" | "timestamp" |
+------+---------+-------+-------+-------------+
| 1 | 1 | 10 | 20 | 1 |
| 2 | 1 | 11 | 19 | 2 |
| 3 | 1 | 10 | 20 | 3 |
| 4 | 1 | 10 | 20 | 4 |
+------+---------+-------+-------+-------------+
Oftentimes, the updates I receive are identical. A particularly critical piece of information I'm trying to extract is when I first received the combination of values that is current — but not for the first time ever, but rather the first update after which there haven't been any changes.
A naïve approach would be the following query:
SELECT DISTINCT ON ("group", foo, bar) *
FROM foobar
ORDER BY "group", foo, bar, timestamp DESC;
However, that query would return the last row, which has the latest timestamp. If I switch timestamp to ASC
, I would get the very first row, because I have seen the exact foo/bar value combination prior to the update at timestamp 2.
The intuitive thing would have been to simply move the timestamp DESC
sort command prior to foo
, but Postgres does not allow that. Something like MySQL's HAVING
operation could also have come in handy, but Postgres unfortunately doesn't support that.
An incredibly inefficient approach I could take is programmatically iterate through each group, get the latest row, and then fetch all rows in descending timestamp order and stop as soon as I observe a change, but it seems that letting a database do this sort of operation would be wiser.
I am quite certain that I'm missing something obvious, but would greatly appreciate any help. Thanks!
That's a gaps-and-islands problem.
You can compare each row's (foo,bar)
to previous row's using lag()
. The window
definition lets you only check those coming from the same group, in ascending order.
From that, you can get the latest "changing update" per group with a distinct on()
or the 1=row_number()over w
, as already described by @Stefanov.sm.
Demo at db<>fiddle:
select distinct on("group") id,"group",foo,bar,"timestamp"
from (select *,coalesce( foo<>lead(foo)over w1
or bar<>lead(bar)over w1,true) is_diff_from_prev
from foobar
window w1 as (partition by "group" order by "timestamp" desc))_
where is_diff_from_prev
order by "group","timestamp" desc;
id | group | foo | bar | timestamp |
---|---|---|---|---|
3 | 1 | 10 | 20 | 3 |
For each group, this starts from the latest record and seeks the first one that changed either value. It takes around 3s on 200k rows with 4k groups, 7% of which have duplicate, non-changing updates among the most recent rows.