I am looking for the fastest way to get the 1st record (columns a,b,c ) for every partition (a,b) using SQL. Table is ~10, 000, 000 rows.
Approach #1:
SELECT * FROM (
SELECT a,b,c,
ROW_NUMBER() OVER ( PARTITION by a, b ORDER BY date DESC) as row_num
FROM T
) WHERE row_num =1
But it probably does extra work behind the scene - I need only 1st row per partition.
Approach #2 using FIRST_VALUE(). Since FIRST_VALUE() returns expression let pack/concatenate a,b,c using some separator into single expression, e.g.:
SELECT FIRST_VALUE(a+','+'b'+','+c)
OVER ( PARTITION by a, b ORDER BY date DESC rows unbounded preceding) FROM T
But in this case I need to unpack the result, which is extra step.
Approach #3 using FIRST_VALUE() - repeat OVER (...) for a , b :
SELECT
FIRST_VALUE(a)
OVER ( PARTITION by a, b ORDER BY date DESC rows unbounded preceding),
FIRST_VALUE(b)
OVER ( PARTITION by a, b ORDER BY date DESC rows unbounded preceding),
c
FROM T
In approach #3 I do not know if database engine (Redshift) smart enough to partition only once
The first query is different from the other two. The first only returns one row per group. The other two return the same rows as in the original query.
You should use the version that does what you want, which I presume is the first one. If you add select distinct
or group by
to the other queries, that will probably add overhead that will make them slower -- but you can test on your data to see if that is true.
Your intuition is correct that the first query does unnecessary work. In databases that support indexes fully, a correlated subquery is often faster. I don't think that would be the case in Redshift, however.