sqlsnowflake-cloud-data-platformwindow-functionsmatch-recognize

How to get specific rows for each id?


To understand a business process with several statuses,

I want to get the rows with the following rules based on created_at column:


Below I highlighted the rows I want to retrieve.


enter image description here

Here is the example data on DB-FIDDLE

Here is the general flow: created > missing_info > pending > successful. But can be only like this too: created > successful.

I know that I can use QUALIFY with window functions and can get 'created' and 'successful' as below. But I don’t know how to get interim statuses. How can I achieve the desired output?

created AS(

SELECT *

FROM t1

WHERE status = 'created'

QUALIFY ROW_NUMBER() OVER (PARTITION BY STATUS, id ORDER BY created_at) = 1 )

Please note that created and successful are start and end statuses, so there will be only one row in the output. Others like missing_info or pending are interim statuses, so can be multiple of them in the desired output.


EDIT:

To understand a business process with several statuses,

I want to get some of the rows with the following two rules based on created_at column:


enter image description here

Example data:

WITH t1 AS (

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:20:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'pending' AS status, '2021-07-15 11:30:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:10:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 12:20:00'::timestamp AS created_at UNION ALL

SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:30:00'::timestamp AS created_at

    )

SELECT *

FROM t1

Desired output:

enter image description here


Solution

  • Snowflake implements MATCH_RECOGNIZE, which is the simplest tool for finding complex patterns in pure SQL:

    Recognizes matches of a pattern in a set of rows. MATCH_RECOGNIZE accepts a set of rows (from a table, view, subquery, or other source) as input, and returns all matches for a given row pattern within this set. The pattern is defined similarly to a regular expression.

    Data preparation:

    CREATE OR REPLACE TABLE t
    AS
    WITH t1 AS (
    SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:30:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'created' AS status, '2021-07-15 10:38:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:10:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 11:12:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 13:36:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'missing_info' AS status, '2021-07-15 14:36:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'pending' AS status, '2021-07-15 12:05:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'successful' AS status, '2021-07-15 16:05:00'::timestamp AS created_at UNION ALL
    SELECT 'A' AS id, 'successful' AS status, '2021-07-15 17:00:00'::timestamp AS created_at UNION ALL
    SELECT 'B' AS id, 'created' AS status, '2021-07-16 10:30:00'::timestamp AS created_at UNION ALL
    SELECT 'B' AS id, 'created' AS status, '2021-07-16 11:30:00'::timestamp AS created_at UNION ALL
    SELECT 'B' AS id, 'successful' AS status, '2021-07-16 12:30:00'::timestamp AS created_at
        )     
    SELECT * FROM t1;
    

    Query for scenario 1:

    SELECT *
    FROM t
    MATCH_RECOGNIZE (
      PARTITION BY ID
      ORDER BY CREATED_AT
      -- MEASURES MATCH_NUMBER() AS m, --LAST/FIRST/CLASSIFIER/...
      ALL ROWS PER MATCH
      PATTERN (c+m+)
      DEFINE
         c AS status='created'
        ,m AS status='missing_info'
        ,p AS status='pending'
        ,s AS status='succesful'
    ) mr
    ORDER BY ID, CREATED_AT;
    -- returns rows 1-4
    

    Key point here is pattern which is provided as Perl-style regular expression. Here we are searching for pattern of one or more "create" finished by one or more "missing_info".

    ALL ROWS PER MATCH - return all rows but it could be changed to first row if necessary

    MEASURES: Specifying Additional Output Columns could be used to provide additional info like MATCH_NUMBER/MATCH_SEQUENCE_NUMBER/CLASSIFIER and more depending of specific needs.

    More patterns in single query could be provided by using '|' (alternative): (c+m+|pm+|...)


    EDIT:

    "Thanks for the answer! It returns first 4 rows. I was essentially needed 1st and 4th row."

    Once groups are identified, filtering first and last row could be achieved for instance with QUALIFY. The key is to use MEASURES that I mentioned before:

    SELECT *
        FROM t
        MATCH_RECOGNIZE (
          PARTITION BY ID
          ORDER BY CREATED_AT
          MEASURES MATCH_NUMBER() AS mn,
                   MATCH_SEQUENCE_NUMBER AS msn
          ALL ROWS PER MATCH
          PATTERN (c+m+)
          DEFINE
             c AS status='created'
            ,m AS status='missing_info'
            ,p AS status='pending'
            ,s AS status='succesful'
        ) mr
        QUALIFY (ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn) = 1)
              OR(ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn DESC)=1)
        ORDER BY ID, CREATED_AT;
        -- returns first and last row by group consisted of ID and MATCH_NUMBER