sqlleft-joinsnowflake-cloud-data-platformattribution

LIMIT / Filtering on LEFT JOIN


I have two tables, one table is a list of purchases with revenue, purchase_time and a user id, the other table has a list of a list of campaign clicks with campaign_id, user_id, click_time. campaign_clicks essentially logs all clicks from a campaign, there can any number of clicks or none and these could happen anytime, before or after a purchase but what I need to do is determine which campaign_id was the last campaign to be clicked on before the purchase was made by any given user and what was the total revenue attributed to that campaign_id. I want to only attribute revenue to clicks that occurred within 3 days prior to the purchase.

purchases

date user_id revenue purchase_time
2020/09/01 10 30.0 2020/09/01 10:10:00 am
2020/09/01 20 15.0 2020/09/02 09:15:00 am
2020/09/01 30 25.0 2020/09/02 08:15:00 am

campaign_clicks

user_id campaign_id click_time
10 2 2020/09/01 10:01:00 am
10 1 2020/09/01 10:05:00 am
10 2 2020/09/01 10:20:00 am
20 2 2020/09/01 10:10:00 am
30 2 2020/09/01 07:30:00 am

desired result

date campaign_id revenue
2020/09/01 1 30.0
2020/09/01 2 25.0

purchase from user id 20 shouldn't be included because it occurred before the click_time. User 10 revenue should be attributed to campaign 2 because the click occurred just before the purchase.

My problem is the join I have is returning all the clicks which is inflating the revenue. The select in the inner join isn’t aware of the purchase time, I need to somehow filter and narrow down the clicks to a single click, the last click. I've tried using ROW_NUMBER() to apply an index but that doesn't allow me to filter out clicks that occur after the purchase.

This is where I’m at

SELECT  
  date
  ,ROUND(sum(revenue)) as revenue
  ,campaign_clicks.campaign_id
FROM 
    purchases                    
       
        LEFT JOIN ( 

                   SELECT                                   
                        campaign_id 
                        ,user_id
                        ,click_time                       
                   FROM 
                      campaign_clicks            
                   ORDER BY         
                      click_time DESC                             
                  ) AS clicks ON clicks.user_id = purchases.user_id 
WHERE 
  -- only select campaign clicks that occurred before the purchase                
  purchases.purchase_time > clicks.click_time

  -- only include clicks that occurred within 3 days of the purchase               
  AND DATEDIFF(minutes, clicks.click_time,purchases.purchase_time) <= (60*24*3)

  -- PROBLEM HERE - there can be still a number of other clicks that occurred before the purchase I need to filter to only the last one 
GROUP BY 
   date
 ,clicks.campaign_id

Solution

  • Snowflake supports joining laterally. That is, on to a function or correlated sub-query. This allows you to join on to a query that returns just one row (per input row).

    SELECT  
      purchases.date
     ,purchases.revenue
     ,clicks.campaign_id
    FROM 
      purchases    
    LEFT JOIN LATERAL
    (
      SELECT
        campaign_id 
       ,user_id
       ,click_time                       
      FROM 
        campaign_clicks
      WHERE
                user_id = purchases.user_id
        -- only select campaign clicks that occurred before the purchase                
        AND click_time <  purchases.purchase_time
        -- only include clicks that occurred within 3 days of the purchase               
        AND click_time >= DATEADD(days, -3, purchases.purchase_time)
      ORDER BY
        click_time DESC
      LIMIT
        1                        
    )
      AS clicks