sqlpattern-matchingverticaweb-analytics

Issue identifying pattern with vertica match clause


I'm having some difficulty understanding how to leverage Vertica's match clause to identify sessions in which a user searched for something on our site (event_category ='Search') and then saw a product carousel item (product_list ='banner' AND event_action ='impression').

Varying events are captured before, after, and during the pattern I'd like to identify, as the number of products that appear on a page and a user's engagement with our site vary can from session to session and user to user.

Raw Data Example

| hit_number | product_list         | Event_Category | Event_Action | Event_Label   |
|------------|----------------------|----------------|--------------|---------------|
| 105        | (null)               | Search         | Submit       | chocolate     |
| 106        | (null)               | eec            | impression   | search-result |
| 107        | search-result        | eec            | impression   | sendData      |
| 107        | search-result        | eec            | impression   | sendData      |
| 107        | search-result        | eec            | impression   | sendData      |
| 107        | search-result        | eec            | impression   | sendData      |
| 108        | (null)               | (null)         | (null)       | (null)        |
| 109        | (null)               | eec            | impression   | banner        |
| 110        | banner-105-chocolate | eec            | impression   | sendData      |
| 110        | banner-105-chocolate | eec            | impression   | sendData      |
| 110        | banner-105-chocolate | eec            | impression   | sendData      |

For the pattern to be valid, there must be at least 1 search event and 1 banner impression, I've set the pattern to (Search+ Banner+) to reflect this, but I'm not returning any results when I run execute the SQL query shown below.

SELECT
page_title
,event_label
,event_name()
,match_id()
,pattern_id()

FROM 
        (SELECT
                unique_visit_id
                ,hit_number
                ,event_category
                ,event_label
                ,event_action
                ,product_list

        FROM 
                atomic.ga_sessions_hits_product_expanded
        WHERE
                1=1
                AND ga_sessions_date >= CURRENT_DATE -3
                AND unique_visit_id = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
        ORDER BY 
                hit_number ASC) base

Match
    (Partition by unique_visit_id Order by hit_number
     Define
            Search as event_category ='Search' and event_action = 'Submit',
            Banner as product_list ilike 'banner-%' and event_action ='impression'
     Pattern
            P as (Search+ BannerImpression+)
    ROWS MATCH FIRST EVENT)

Please let me know if there's anything I should clarify, any insights or assistance would be greatly appreciated!


Solution

  • First, the column you're partitioning by is not in the example input. I added it and gave it the value 42 for all rows in your input data.

    Your problem is that there are no patterns, in that data snippet, where an event that you named banner immediately follows an event that you named search

    I added yet another event into the DEFINE clause, at the end. If the other two don't evaluate to true, the last, which is just defined as other AS true, will be picked (that's the behaviour of ROWS MATCH FIRST EVENT) .

    And the pattern then becomes (search+ other* banner+), and that one is then found.

    See here:

    WITH
    ga_sessions_hits_product_expanded(
       unique_visit_id,hit_number,product_list,Event_Category,Event_Action,Event_Label
    ) AS (
              SELECT 42,105,NULL,'Search','Submit','chocolate'
    UNION ALL SELECT 42,106,NULL,'eec','impression','search-result'
    UNION ALL SELECT 42,107,'search-result','eec','impression','sendData'
    UNION ALL SELECT 42,107,'search-result','eec','impression','sendData'
    UNION ALL SELECT 42,107,'search-result','eec','impression','sendData'
    UNION ALL SELECT 42,107,'search-result','eec','impression','sendData'
    UNION ALL SELECT 42,108,NULL,NULL,NULL,NULL
    UNION ALL SELECT 42,109,NULL,'eec','impression','banner'
    UNION ALL SELECT 42,110,'banner-105-chocolate','eec','impression','sendData'
    UNION ALL SELECT 42,110,'banner-105-chocolate','eec','impression','sendData'
    UNION ALL SELECT 42,110,'banner-105-chocolate','eec','impression','sendData'
    )
    SELECT
      *
    , event_name()
    , pattern_id()
    , match_id()
    FROM ga_sessions_hits_product_expanded
    MATCH(
      PARTITION BY unique_visit_id ORDER BY hit_number
      DEFINE
        search AS event_category='Search' AND event_action='Submit'
      , banner AS product_list ILIKE 'banner-%' AND event_action='impression'
      , other AS true
      PATTERN p AS (search+ other* banner+)
      ROWS MATCH FIRST EVENT
    );
    -- out Null display is "NULL".
    -- out  unique_visit_id | hit_number |     product_list     | Event_Category | Event_Action |  Event_Label  | event_name | pattern_id | match_id 
    -- out -----------------+------------+----------------------+----------------+--------------+---------------+------------+------------+----------
    -- out               42 |        105 | NULL                 | Search         | Submit       | chocolate     | search     |          1 |        1
    -- out               42 |        106 | NULL                 | eec            | impression   | search-result | other      |          1 |        2
    -- out               42 |        107 | search-result        | eec            | impression   | sendData      | other      |          1 |        3
    -- out               42 |        107 | search-result        | eec            | impression   | sendData      | other      |          1 |        4
    -- out               42 |        107 | search-result        | eec            | impression   | sendData      | other      |          1 |        5
    -- out               42 |        107 | search-result        | eec            | impression   | sendData      | other      |          1 |        6
    -- out               42 |        108 | NULL                 | NULL           | NULL         | NULL          | other      |          1 |        7
    -- out               42 |        109 | NULL                 | eec            | impression   | banner        | other      |          1 |        8
    -- out               42 |        110 | banner-105-chocolate | eec            | impression   | sendData      | banner     |          1 |        9
    -- out               42 |        110 | banner-105-chocolate | eec            | impression   | sendData      | banner     |          1 |       10
    -- out               42 |        110 | banner-105-chocolate | eec            | impression   | sendData      | banner     |          1 |       11
    -- out (11 rows)
    -- out 
    -- out Time: First fetch (11 rows): 50.632 ms. All rows formatted: 50.721 ms