Let's say i have a table with Users :
user_id |
---|
abc |
def |
And a table of Purchases :
purchase_id | purchase_date | status | user_id |
---|---|---|---|
1 | 2020-01-01 | sold | abc |
2 | 2020-02-01 | refunded | abc |
3 | 2020-03-01 | sold | def |
4 | 2020-04-01 | sold | def |
5 | 2020-05-01 | sold | def |
I want the status of the last purchase for each user, which would be :
user_id | last_purchase_date | status |
---|---|---|
abc | 2020-02-01 | refunded |
def | 2020-05-01 | sold |
I am wondering what is the best approach (in term of performance, cost, readability ...) between those three queries that give the same results :
Aggregated Function
SELECT
user_id,
MAX(purchase_date) as last_purchase_date,
ARRAY_AGG(status ORDER BY purchase_date DESC LIMIT 1)[SAFE_OFFSET(0)] as last_status
FROM user
LEFT JOIN purchase USING (user_id)
GROUP BY user_id
Analytics Function
SELECT
DISTINCT
user_id,
MAX(purchase_date) OVER (PARTITION BY user_id) as last_purchase_date,
LAST_VALUE(status) OVER (PARTITION BY user_id ORDER BY purchase_date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as last_status,
FROM user
LEFT JOIN purchase USING (user_id)
Subquery
SELECT
user_id,
purchase_date as last_purchase_date,
status as last_status
FROM user
LEFT JOIN purchase USING (user_id)
WHERE purchase_date IN (
SELECT
MAX(purchase_date) as purchase_date
FROM purchase
GROUP BY user_id
)
Here is the Dataset for those who want it :
WITH purchase as (
SELECT 1 as purchase_id, "2020-01-01" as purchase_date, "sold" as status, "abc" as user_id
UNION ALL SELECT 2 as purchase_id, "2020-02-01" as purchase_date, "refunded" as status, "abc" as user_id
UNION ALL SELECT 3 as purchase_id, "2020-03-01" as purchase_date, "sold" as status, "def" as user_id
UNION ALL SELECT 4 as purchase_id, "2020-04-01" as purchase_date, "sold" as status, "def" as user_id
UNION ALL SELECT 5 as purchase_id, "2020-05-01" as purchase_date, "sold" as status, "def" as user_id
), user as (
SELECT "abc" as user_id,
UNION ALL SELECT "def" as user_id,
)
The answer is pretty simple: always use the most performing and cheapest approach.
When you run your code snippets with the provided test data you can see in the execution details which one is better for this use case:
As you can see the aggregated function consumed around 3 times less slot time and shuffled less bytes than the analytics function. That means for this use case it is much better to use your aggregate function then the analytical one.
The power of analytical functions comes when you have use cases where you cannot use aggregate functions and can only be solved with analytical functions. Analytical functions have a wider range of application than other types of functions.