I have a table in Redshift which I can access by SQL, and this table gets rows regularly added to it. The updates are both on a monthly interval (major updates) and on a shorter interval (minor updates), kept track of by the id
column. Suppose this table has three columns, id
, val1
and val2
. The id
column is something like NNaaMM
, where:
NN
is two digits counting the major updates, starting with 00
, 01
,...aa
is two characters indicating the group (which for this question may be considered fixed)MM
is two digits counting the minor updates, starting with 00
, 01
,...The columns val1
and val2
have the information I want to compare. I would like the two most recent large updates, using each of their respective latest minor updates. That is, if the latest large update has id=54aa12
and the large update before that has id=53aa02
, then I would like to get a table that looks like the following:
id | val1 | val2 | which |
---|---|---|---|
54aa12 | foo | 6 | current |
54aa12 | bar | 5 | current |
53aa02 | foo | 10 | previous |
53aa02 | baz | 12 | previous |
So far I have an extremely long and slow query that seems to do the job:
WITH prefixes AS (
SELECT SUBSTRING(id,1,3) AS prefix
FROM table
WHERE id LIKE '%aa%'
GROUP BY prefix
ORDER BY prefix DESC
LIMIT 2),
id_max AS (
SELECT t.id AS id
FROM prefixes
JOIN table t
ON t.id LIKE (SELECT MAX(prefix) FROM prefixes)+'%'
GROUP BY id
ORDER BY id DESC
LIMIT 1),
id_min AS (
SELECT t.id AS id
FROM prefixes
JOIN table t
ON t.id LIKE (SELECT MIN(prefix) FROM prefixes)+'%'
GROUP BY id
ORDER BY id DESC
LIMIT 1)
(SELECT *, 'current' AS which
FROM table
WHERE id = (SELECT MAX(id) FROM id_max))
UNION ALL
(SELECT *, 'previous' AS which
FROM table
WHERE id= (SELECT MAX(id) FROM id_min))
My question is: Is there a better / faster / cleaner way to do this?
I'm aware everything is working (and so why am I posting here...), but right now it feels like a hack that only I can understand, and this code is part of a bigger project that I have to share with others. So code readability (and shortness if possible) is important, I'm hoping to have something like a single SELECT .. WHERE id=CURRENT(..) OR id=PREVIOUS(..)
, and I know these functions don't exist, just that format would be preferrable.
I suspect there are more cases than your sample data covers, but row_number() can be used here.
Rationale - Partition by all the records which have minor updates, of those row_number() ordered by the first three characters will give you 1 & 2 for your latest two and >2 for everything else.
CREATE TABLE your_table_name (
id VARCHAR(50),
val1 VARCHAR(50),
val2 INT,
which VARCHAR(50)
);
INSERT INTO your_table_name (id, val1, val2, which) VALUES ('54aa12', 'foo', 6, 'current');
INSERT INTO your_table_name (id, val1, val2, which) VALUES ('54aa12', 'bar', 5, 'current');
INSERT INTO your_table_name (id, val1, val2, which) VALUES ('53aa02', 'foo', 10, 'previous');
INSERT INTO your_table_name (id, val1, val2, which) VALUES ('53aa02', 'baz', 12, 'previous');
select *,
case when
row_number() over (
partition by id like '%aa%'
order by substr(id, 1, 3) desc
) <= 2
then 'current'
else 'previous'
end as derived_which
from your_table_name;
id | val1 | val2 | which | derived_which |
---|---|---|---|---|
54aa12 | foo | 6 | current | current |
54aa12 | bar | 5 | current | current |
53aa02 | foo | 10 | previous | previous |
53aa02 | baz | 12 | previous | previous |