Using Postgres 11.6, I'm trying to write a query that filters out events based on their position in a sequence during grouping. We've got a lot of event data grouped in "cycles". Here's the table setup:
Here's the table setup:
CREATE TABLE IF NOT EXISTS data.event (
id uuid NOT NULL,
cycle_id uuid,
dts timestamp without time zone,
state citext,
PRIMARY KEY(id)
);
Below are some sample records from three different cycles:
id cycle_id dts state
f5c1ecef-b917-6c44-ae18-ed8c8b9e71d5 00000910-da83-7c4b-b47a-38be998471d1 2019-05-15 15:54:17 Boxed
dbf713c9-802d-a445-9cf8-ff2953d1ef6c 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 09:08:23 Assembly
93ea8e6a-2c41-d848-805b-dc83d572af1e 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 09:34:51 Waiting For Quality
bc30d154-8c72-7e40-b75d-048b7f2ca3d4 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 10:43:31 Quality
0b69895a-36fd-274b-8730-e81206636013 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 13:02:07 Waiting After Quality
550a0b9e-7731-4347-ba55-4ca222b3e69f 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 13:39:59 Boxed
1c43d256-807c-8643-82a8-1c32653cedb3 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 19:14:36 Waiting For Assembly
9b4e8e70-d383-4b47-aed4-4aa9de3b5b5d 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 20:13:59 Assembly
1d99218c-ea25-6c4d-91a5-07d5198971f3 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 20:16:11 Waiting For Quality
b9571373-4de1-cc4c-badc-58cbaca2de1a 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 21:36:59 Quality
1d233ed1-ea57-f741-af76-cf2a8630f5d7 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 23:40:04 Waiting After Quality
db305c40-485b-b542-8c05-739ea46fa006 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-16 22:15:13 Storage
2afa5d2c-f061-4c47-be85-010010a12785 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-16 23:01:28 Vendor Dropoff
43a35799-d58e-5049-b9ca-12e0980407b1 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-08 14:09:56 Boxed
8db57a1a-6f02-7c46-81ca-f1e6acb464cd 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:18:02 Assembly
74d53a8d-5e22-274d-b429-b3bfa89b0211 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:18:25 Waiting For Quality
8517b241-1a2f-2c47-80f5-f910a7b0729c 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:25:14 Quality
ebaaf219-7319-084d-9179-ec7b1f4bffdf 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 14:53:42 Waiting After Quality
3ee6ae5b-9e44-4944-9aad-b084552309af 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-14 00:08:19 Storage
cea8cfdb-533d-5e42-b136-b9bfd5d69eee 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-14 23:00:00 Transport
b27e43da-df4f-f642-8158-25efa2f841a4 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-14 14:09:17 Return to Vendor
In reality, the cycles include more steps, I've only included enough to make an example. The goal here is to filter out events after specific cycle phases, "Waiting After Quality" or "Storage". Every cycle has one or both of these events. Written imperatively, the logic for each cycle would be like this:
If (The cycle includes "Storage") then
Keep only the cycle data up to and include "Storage"
Else
Keep only the cycle data up to and including "Waiting After Quality"
End if
The target result includes only these rows:
id cycle_id dts state
f5c1ecef-b917-6c44-ae18-ed8c8b9e71d5 00000910-da83-7c4b-b47a-38be998471d1 2019-05-15 15:54:17 Boxed
dbf713c9-802d-a445-9cf8-ff2953d1ef6c 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 09:08:23 Assembly
93ea8e6a-2c41-d848-805b-dc83d572af1e 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 09:34:51 Waiting For Quality
bc30d154-8c72-7e40-b75d-048b7f2ca3d4 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 10:43:31 Quality
0b69895a-36fd-274b-8730-e81206636013 00000910-da83-7c4b-b47a-38be998471d1 2019-05-16 13:02:07 Waiting After Quality
550a0b9e-7731-4347-ba55-4ca222b3e69f 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 13:39:59 Boxed
1c43d256-807c-8643-82a8-1c32653cedb3 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 19:14:36 Waiting For Assembly
9b4e8e70-d383-4b47-aed4-4aa9de3b5b5d 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 20:13:59 Assembly
1d99218c-ea25-6c4d-91a5-07d5198971f3 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 20:16:11 Waiting For Quality
b9571373-4de1-cc4c-badc-58cbaca2de1a 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 21:36:59 Quality
1d233ed1-ea57-f741-af76-cf2a8630f5d7 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-15 23:40:04 Waiting After Quality
db305c40-485b-b542-8c05-739ea46fa006 0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b 2019-04-16 22:15:13 Storage
43a35799-d58e-5049-b9ca-12e0980407b1 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-08 14:09:56 Boxed
8db57a1a-6f02-7c46-81ca-f1e6acb464cd 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:18:02 Assembly
74d53a8d-5e22-274d-b429-b3bfa89b0211 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:18:25 Waiting For Quality
8517b241-1a2f-2c47-80f5-f910a7b0729c 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 12:25:14 Quality
ebaaf219-7319-084d-9179-ec7b1f4bffdf 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-10 14:53:42 Waiting After Quality
3ee6ae5b-9e44-4944-9aad-b084552309af 00010d69-51d0-8b4f-9acb-1573a97517cd 2019-08-14 00:08:19 Storage
A simple grouping query lays out the cycle events in time order:
select cycle_id,
string_agg(state::text, ', ' order by dts asc) as states
from event
group by cycle_id
cycle_id states
00000910-da83-7c4b-b47a-38be998471d1 Boxed, Assembly, Waiting For Quality, Quality, Waiting After Quality
0000cf14-e5d3-4c4c-9e44-b39bc5f7fa7b Boxed, Waiting For Assembly, Assembly, Waiting For Quality, Quality, Waiting After Quality, Storage, Vendor Dropoff
00010d69-51d0-8b4f-9acb-1573a97517cd Boxed, Assembly, Waiting For Quality, Quality, Waiting After Quality, Storage, Return to Vendor, Transport
I need to find the events from the start of each cycle until and including "Storage" or "Waiting After Quality", with "Storage" in case both are present.
And I'm stumped. I understand how to write this in a client-side language, but not in SQL. I'd much prefer to do it in SQL. If someone can show me how, or point me in the right direction, that would be fantastic.
Here's a final solution with a couple of small syntax changes from the accepted answer.
SELECT event.*
FROM data.event
JOIN (
SELECT cycle_id,
COALESCE(
MIN(dts) FILTER (WHERE state = 'Storage'),
MIN(dts) FILTER (WHERE state = 'Waiting After Quality')) as upto
FROM data.event
GROUP BY cycle_id
) last_events USING (cycle_id)
WHERE dts <= upto
ORDER BY cycle_id,
dts
I'd suggest first finding the timestamps per cycle up-to-and-including which you want to select the events:
SELECT
cycle_id,
COALESCE(
MIN(dts) FILTER (WHERE state = "Storage"),
MIN(dts) FILTER (WHERE state = "Waiting After Quality")
) as upto
FROM data.event
GROUP BY cycle_id
Then you can join these results against all the events, and filter them by the respective condition:
SELECT event.*
FROM data.event
JOIN (
SELECT
cycle_id,
COALESCE(
MIN(dts) FILTER (WHERE state = "Storage"),
MIN(dts) FILTER (WHERE state = "Waiting After Quality")
) as upto
FROM data.event
GROUP BY cycle_id
) last_events USING (cycle_id)
WHERE dts <= upto