sqlamazon-redshiftscd2

Creating SCD History in Redshift Table with SQL


Problem

I have a starting table that looks like this:

StartingData

I'm attempting to write a SQL query that will write this data into a similarly formatted table, but with additional adornments indicating when records expired and what records are active. The results would look like this:

identifier                       | loaddate   | loadenddate | activeflag | symbol
723a90699e99ec9e00216910910384bd | 2020-04-01 | 2020-04-07  | 0          | DXB
723a90699e99ec9e00216910910384bd | 2020-04-08 | 2999-12-31  | 1          | DXB CL

Note that there are 1000s of different identifiers, some with one, two, three+ different symbols over various different timeframes.

Requirements

  1. Anytime an identifier is seen for the first time, it must be created in the final table with today's date as a load date and a 2999-12-31 loadenddate and activeflag=1
  2. When that identifier is seen on a second day, only add a row if the symbol has changed. If it has, "expire" the previous row by setting the previous row's loadenddate to this new row's loaddate - 1 day and activeflag = 0
  3. The sql query (or queries) need to be able to re-run on the source table each day going forward as well, so they correctly handle existing data in the destination table as well as the destination table being blank (initial run)

What I've Got So Far

To load initially (and not duplicate), I have the following SQL:

INSERT INTO finaltable(
listinghashkey
symbol,
loaddate,
loadenddate,
activeflag
)
SELECT
s.listinghashkey
s.symbol,
MAX(s.loaddate),
'2999-12-31 00:00:00.0',
1
FROM
startingtable s
LEFT JOIN finaltable f ON s.listinghashkey = f.listinghashkey
WHERE (f.listinghashkey IS NULL)
GROUP BY s.listinghashkey, s.symbol

Solution

  • Converting your initial format to the new format is pretty easy as a gaps-and-islands problem:

    select identifier, symbol, min(loaddate),
           nullif(max(loaddate), max_loaddate)
    from (select s.*,
                 max(loaddate) over () as max_loaddate,
                 row_number() over (partition by identifier order by loaddate) as seqnum,
                 row_number() over (partition by identifier, symbol order by loaddate) as seqnum_2
          from startingtable
         ) s
    group by identifier, symbol, (seqnum - seqnum_2);
    

    This eschews the "is active" flag and the arbitrary future date. It just uses NULL for the infinite future. (You can easily tweak the logic for your version; this is just simpler to work with.)

    If you have this table and you want to add the next load date, then you have can construct the next version in its entirety using union all. The idea is to divide the processing into four steps:

    The SQL looks like:

    -- all completed records
    select ft.identifier, ft.symbol, ft.loaddate, ft.loadenddate
    from finaltable ft
    where loadenddate is not null
    union all
    -- Basically copy over records where the new data is consistent
    select ft.identifer, ft.symbol, ft.loaddate, ft.loadenddate
    from finaltable ft join
         oneload ol
         on ft.identifier = ol.identifier and
            ft.symbol = ol.symbol
    where ft.loadenddate is null
    union all
    -- close records that are not in the new batch
    select ft.identifer, ft.symbol, ft.loaddate, m.loaddate - interval '1 day'
    from finaltable ft cross join
         (select max(loaddate) as loaddate
          from oneload
         ) m left join
         oneload ol
         on ft.identifier = ol.identifier and
            ft.symbol = ol.symbol
    where ft.loadenddate is null
    -- finally add new records
    select ol.identifer, ol.symbol, ol.loaddate, null
    from oneload ol left join
         finaltable ft
         on ft.identifier = ol.identifier and
            ft.symbol = ol.symbol and
            ft.loadenddate is null
    where ft.identifier is null;
    

    I prefer to do this as a select/replace operation rather than a series of insert/update steps (or using merge perhaps). But you this is the basic idea.