To understand a business process with several statuses,
I want to get the rows with the following rules based on created_at
column:
First row of status ‘created’
Last row of ‘missing_info’
after ‘created’
(row_no 4)
First row of ‘pending’
(row_no 5)
Last row of ‘missing_info’
after ‘pending’
(row_no 7)
First row of ‘pending’
after 'missing_info'
(row_no 8)
Last row of ‘successful’
(row_no 10)
Below I highlighted the rows I want to retrieve.
Here is the example data on DB-FIDDLE
Here is the general flow: created > missing_info > pending > successful. But can be only like this too: created > successful.
I know that I can use QUALIFY
with window functions and can get 'created'
and 'successful'
as below. But I don’t know how to get interim statuses. How can I achieve the desired output?
created AS(
SELECT *
FROM t1
WHERE status = 'created'
QUALIFY ROW_NUMBER() OVER (PARTITION BY STATUS, id ORDER BY created_at) = 1 )
Please note that created
and successful
are start and end statuses, so there will be only one row in the output. Others like missing_info
or pending
are interim statuses, so can be multiple of them in the desired output.
To understand a business process with several statuses,
I want to get some of the rows with the following two rules based on created_at
column:
last row of status ‘missing_info’ before ‘pending’ (row 2)
pending (row 3)
last row of status ‘missing_info’ before ‘pending’ (row 5)
pending (row 6)
Example data:
WITH t1 AS (
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:20:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 11:30:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:10:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:20:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:30:00'::timestamp AS created_at
)
SELECT *
FROM t1
Desired output:
Snowflake implements MATCH_RECOGNIZE, which is the simplest tool for finding complex patterns in pure SQL:
Recognizes matches of a pattern in a set of rows. MATCH_RECOGNIZE accepts a set of rows (from a table, view, subquery, or other source) as input, and returns all matches for a given row pattern within this set. The pattern is defined similarly to a regular expression.
Data preparation:
CREATE OR REPLACE TABLE t
AS
WITH t1 AS (
SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:30:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:38:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:12:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 13:36:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 14:36:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'successful' AS status, '2021-07-15 16:05:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'successful' AS status, '2021-07-15 17:00:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-07-16 10:30:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'created' AS status, '2021-07-16 11:30:00'::timestamp AS created_at UNION ALL
SELECT 'B' AS id, 'successful' AS status, '2021-07-16 12:30:00'::timestamp AS created_at
)
SELECT * FROM t1;
Query for scenario 1:
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY ID
ORDER BY CREATED_AT
-- MEASURES MATCH_NUMBER() AS m, --LAST/FIRST/CLASSIFIER/...
ALL ROWS PER MATCH
PATTERN (c+m+)
DEFINE
c AS status='created'
,m AS status='missing_info'
,p AS status='pending'
,s AS status='succesful'
) mr
ORDER BY ID, CREATED_AT;
-- returns rows 1-4
Key point here is pattern which is provided as Perl-style regular expression. Here we are searching for pattern of one or more "create" finished by one or more "missing_info".
ALL ROWS PER MATCH
- return all rows but it could be changed to first row if necessary
MEASURES: Specifying Additional Output Columns could be used to provide additional info like MATCH_NUMBER/MATCH_SEQUENCE_NUMBER/CLASSIFIER and more depending of specific needs.
More patterns in single query could be provided by using '|' (alternative): (c+m+|pm+|...)
EDIT:
"Thanks for the answer! It returns first 4 rows. I was essentially needed 1st and 4th row."
Once groups are identified, filtering first and last row could be achieved for instance with QUALIFY
. The key is to use MEASURES that I mentioned before:
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY ID
ORDER BY CREATED_AT
MEASURES MATCH_NUMBER() AS mn,
MATCH_SEQUENCE_NUMBER AS msn
ALL ROWS PER MATCH
PATTERN (c+m+)
DEFINE
c AS status='created'
,m AS status='missing_info'
,p AS status='pending'
,s AS status='succesful'
) mr
QUALIFY (ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn) = 1)
OR(ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn DESC)=1)
ORDER BY ID, CREATED_AT;
-- returns first and last row by group consisted of ID and MATCH_NUMBER