I have two tables, one table is a list of purchases with revenue, purchase_time and a user id, the other table has a list of a list of campaign clicks with campaign_id, user_id, click_time. campaign_clicks essentially logs all clicks from a campaign, there can any number of clicks or none and these could happen anytime, before or after a purchase but what I need to do is determine which campaign_id was the last campaign to be clicked on before the purchase was made by any given user and what was the total revenue attributed to that campaign_id. I want to only attribute revenue to clicks that occurred within 3 days prior to the purchase.
purchases
date | user_id | revenue | purchase_time |
---|---|---|---|
2020/09/01 | 10 | 30.0 | 2020/09/01 10:10:00 am |
2020/09/01 | 20 | 15.0 | 2020/09/02 09:15:00 am |
2020/09/01 | 30 | 25.0 | 2020/09/02 08:15:00 am |
campaign_clicks
user_id | campaign_id | click_time |
---|---|---|
10 | 2 | 2020/09/01 10:01:00 am |
10 | 1 | 2020/09/01 10:05:00 am |
10 | 2 | 2020/09/01 10:20:00 am |
20 | 2 | 2020/09/01 10:10:00 am |
30 | 2 | 2020/09/01 07:30:00 am |
desired result
date | campaign_id | revenue |
---|---|---|
2020/09/01 | 1 | 30.0 |
2020/09/01 | 2 | 25.0 |
purchase from user id 20 shouldn't be included because it occurred before the click_time. User 10 revenue should be attributed to campaign 2 because the click occurred just before the purchase.
My problem is the join I have is returning all the clicks which is inflating the revenue. The select in the inner join isn’t aware of the purchase time, I need to somehow filter and narrow down the clicks to a single click, the last click. I've tried using ROW_NUMBER() to apply an index but that doesn't allow me to filter out clicks that occur after the purchase.
This is where I’m at
SELECT
date
,ROUND(sum(revenue)) as revenue
,campaign_clicks.campaign_id
FROM
purchases
LEFT JOIN (
SELECT
campaign_id
,user_id
,click_time
FROM
campaign_clicks
ORDER BY
click_time DESC
) AS clicks ON clicks.user_id = purchases.user_id
WHERE
-- only select campaign clicks that occurred before the purchase
purchases.purchase_time > clicks.click_time
-- only include clicks that occurred within 3 days of the purchase
AND DATEDIFF(minutes, clicks.click_time,purchases.purchase_time) <= (60*24*3)
-- PROBLEM HERE - there can be still a number of other clicks that occurred before the purchase I need to filter to only the last one
GROUP BY
date
,clicks.campaign_id
Snowflake supports joining laterally. That is, on to a function or correlated sub-query. This allows you to join on to a query that returns just one row (per input row).
SELECT
purchases.date
,purchases.revenue
,clicks.campaign_id
FROM
purchases
LEFT JOIN LATERAL
(
SELECT
campaign_id
,user_id
,click_time
FROM
campaign_clicks
WHERE
user_id = purchases.user_id
-- only select campaign clicks that occurred before the purchase
AND click_time < purchases.purchase_time
-- only include clicks that occurred within 3 days of the purchase
AND click_time >= DATEADD(days, -3, purchases.purchase_time)
ORDER BY
click_time DESC
LIMIT
1
)
AS clicks