sqlamazon-redshiftself-updating

Get the two latest major updates (and their latest minor updates) in SQL


I have a table in Redshift which I can access by SQL, and this table gets rows regularly added to it. The updates are both on a monthly interval (major updates) and on a shorter interval (minor updates), kept track of by the id column. Suppose this table has three columns, id, val1 and val2. The id column is something like NNaaMM, where:

The columns val1 and val2 have the information I want to compare. I would like the two most recent large updates, using each of their respective latest minor updates. That is, if the latest large update has id=54aa12 and the large update before that has id=53aa02, then I would like to get a table that looks like the following:

id val1 val2 which
54aa12 foo 6 current
54aa12 bar 5 current
53aa02 foo 10 previous
53aa02 baz 12 previous

So far I have an extremely long and slow query that seems to do the job:

WITH prefixes AS (
    SELECT SUBSTRING(id,1,3) AS prefix  
    FROM table
    WHERE id LIKE '%aa%'
    GROUP BY prefix
    ORDER BY prefix DESC
    LIMIT 2),

id_max AS (
    SELECT t.id AS id
    FROM prefixes 
    JOIN table t
    ON t.id LIKE (SELECT MAX(prefix) FROM prefixes)+'%'
    GROUP BY id
    ORDER BY id DESC
    LIMIT 1),

id_min AS (
    SELECT t.id AS id
    FROM prefixes 
    JOIN table t
    ON t.id LIKE (SELECT MIN(prefix) FROM prefixes)+'%'
    GROUP BY id
    ORDER BY id DESC
    LIMIT 1)

(SELECT *, 'current' AS which
FROM table
WHERE id = (SELECT MAX(id) FROM id_max))

UNION ALL

(SELECT *, 'previous' AS which
FROM table
WHERE id= (SELECT MAX(id) FROM id_min))

My question is: Is there a better / faster / cleaner way to do this?

I'm aware everything is working (and so why am I posting here...), but right now it feels like a hack that only I can understand, and this code is part of a bigger project that I have to share with others. So code readability (and shortness if possible) is important, I'm hoping to have something like a single SELECT .. WHERE id=CURRENT(..) OR id=PREVIOUS(..), and I know these functions don't exist, just that format would be preferrable.


Solution

  • I suspect there are more cases than your sample data covers, but row_number() can be used here.

    Rationale - Partition by all the records which have minor updates, of those row_number() ordered by the first three characters will give you 1 & 2 for your latest two and >2 for everything else.

    CREATE TABLE your_table_name (
        id VARCHAR(50),
        val1 VARCHAR(50),
        val2 INT,
        which VARCHAR(50)
    );
    
    INSERT INTO your_table_name (id, val1, val2, which) VALUES ('54aa12', 'foo', 6, 'current');
    INSERT INTO your_table_name (id, val1, val2, which) VALUES ('54aa12', 'bar', 5, 'current');
    INSERT INTO your_table_name (id, val1, val2, which) VALUES ('53aa02', 'foo', 10, 'previous');
    INSERT INTO your_table_name (id, val1, val2, which) VALUES ('53aa02', 'baz', 12, 'previous');
    
    select *,
           case when 
               row_number() over ( 
                 partition by id like '%aa%'
                     order by substr(id, 1, 3) desc
                 ) <= 2 
               then 'current' 
               else 'previous'
           end as derived_which
      from your_table_name;
    
    id val1 val2 which derived_which
    54aa12 foo 6 current current
    54aa12 bar 5 current current
    53aa02 foo 10 previous previous
    53aa02 baz 12 previous previous