sqlsql-servercdcscdscd2

Implement SCD Type 2 on periodic snapshot table


Currently I have a very big table that have a snapshot of data for each month.

ID other Team Period
1 ..... A 2020-04-30
1 ..... A 2020-05-31
1 ..... A 2020-06-30
1 ..... A 2020-07-31
1 ..... B 2020-08-31
1 ..... B 2020-09-30
1 ..... C 2020-10-31
1 ..... C 2020-11-30
1 ..... C 2020-12-31

I would like to implement SCD on this table based on comparison of all listed columns. How Can I achieve it?

In this example I would like to have only 3 rows where Team column was changed and proper valid_from, valid_to columns based on period column.


Solution

  • If you can get those who built the monthly snapshot to do otherwise, encourage them strongly to do so.

    For the rest: in one query, get the previous value for team, as prev_team - I do that in a WITH clause, and name the Common table Expression w_prev.

    Then I run an outer query on w_prev, filtering by the previous team being NULL or different from the current team value - and use the three-parameter variant of the LEAD() OLAP function to get the next date, or, if missing, a maximum possible date:

    WITH
    -- your input, don't use in final query ..
    indata(id,other,team,period) AS (
              SELECT 1,'.....','A',CAST('2020-04-30' AS DATE)
    UNION ALL SELECT 1,'.....','A',CAST('2020-05-31' AS DATE)
    UNION ALL SELECT 1,'.....','A',CAST('2020-06-30' AS DATE)
    UNION ALL SELECT 1,'.....','A',CAST('2020-07-31' AS DATE)
    UNION ALL SELECT 1,'.....','B',CAST('2020-08-31' AS DATE)
    UNION ALL SELECT 1,'.....','B',CAST('2020-09-30' AS DATE)
    UNION ALL SELECT 1,'.....','C',CAST('2020-10-31' AS DATE)
    UNION ALL SELECT 1,'.....','C',CAST('2020-11-30' AS DATE)
    UNION ALL SELECT 1,'.....','C',CAST('2020-12-31' AS DATE)
    )
    -- end of input, replace following comma with "WITH" ..                                                                                                                                                       
    ,
    w_prev AS (
      SELECT
        *
      , LAG(team) OVER(PARTITION BY id ORDER BY period) AS prev_team
      FROM indata
    )
    SELECT
      id
    , other
    , team
    , period
    , LEAD(period,1,CAST('9999-12-01' AS DATE)) OVER(PARTITION BY id ORDER BY period) AS to_period
    FROM w_prev
    WHERE prev_team IS NULL
       OR prev_team <> team
    ;
    
    id other team period to_period
    1 ..... A 2020-04-30 2020-08-31
    1 ..... B 2020-08-31 2020-10-31
    1 ..... C 2020-10-31 9999-12-01