sqlsql-server-2012scd2

Use SQL to remove duplicates from a type 2 slowly changing dimension


I'm building a data warehouse with multiple layers storing the same data. All of the data in one of the middle layers is versioned with start and end dates, as if it was a type 2 slowly changing dimension. The problem occurs when I query these tables. There are typically more columns in the table than in the query, so adjacent versions in the query have different start and end dates, but are otherwise identical. I want to combine these versions, to show dates when columns in the query change, not when rows in the table change.

I have some SQL that almost works:

create table versions 
(id int
, name varchar(100) Not null
, RowStartDate datetime Not null
, RowEndDate datetime Not null
, primary key (id,RowStartDate)
, check (RowStartDate < RowEndDate));

insert into versions values 
 (1,'A','2014-01-01','9999-12-31')
,(2,'B','2014-01-01','2014-12-31')
,(2,'B','2014-12-31','9999-12-31')
,(3,'C','2014-01-01','2014-12-31')
,(3,'CC','2014-12-31','2015-12-31')
,(3,'CC','2015-12-31','9999-12-31')
,(4,'D','2014-01-01','2014-12-31')
,(4,'DD','2014-12-31','2015-12-31')
,(4,'DD','2015-12-31','2016-12-31')
,(4,'D','2016-12-31','9999-12-31')
,(5,'E','2014-01-01','2014-12-31')
,(5,'E','2014-12-31','2015-12-31')
,(5,'E','2015-12-31','2016-12-31')
,(5,'E','2016-12-31','2017-12-31')
,(5,'E','2017-12-31','9999-12-31')
;

WITH CTE_detect_duplicates AS (SELECT [id]
      ,[name]
      ,[RowStartDate]
      ,[RowEndDate]
      ,LAST_VALUE(RowEndDate) OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as LastEndDate
      ,rank() OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate) as duplicateNumber
  FROM versions
 )
SELECT [id]
      ,[name]
      ,[RowStartDate]
      ,LastEndDate as RowEndDate
FROM CTE_detect_duplicates
WHERE duplicateNumber = 1

The problem here is that it returns two rows for id "4", when three are desired. Actual:

id  name    RowStartDate    RowEndDate
4   D   2014-01-01 00:00:00.000 9999-12-31 00:00:00.000
4   DD  2014-12-31 00:00:00.000 2016-12-31 00:00:00.000
Desired:
id  name    RowStartDate    RowEndDate
4   D   2014-01-01 00:00:00.000 2014-12-31 00:00:00.000
4   DD  2014-12-31 00:00:00.000 2016-12-31 00:00:00.000
4   D   2016-12-31 00:00:00.000 9999-12-31 00:00:00.000
The value D is not correct for the period when the value DD is correct, so the version dates are incorrect on the first row (4,'D') from the my query.

I want to be able to remove these duplicates in pure SQL, or an in-line table valued function (I have a generator that creates multi-statement table valued functions that do this, but the resulting functions perform poorly). Does anyone have any ideas?


Solution

  • The following query, containing multiple CTE's compresses the date ranges of the updates and removes duplicate values.

    1 First ranks are assigned within each id group, based on the RowStartDate.

    2 Next, the maximum rank (next_rank_no) of the range of ranks which has the same value for NAME is determined. Thus, for the example data, row 1 of id=5 would have next_rank_no=5 and row 2 of id=4 would have next_rank_no=3. This version only handles the NAME column. If you want to handle additional columns, they must be included in the condition as well. For example, if you want to include a LOCATION column, then the join conditions would read as:

      left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name and sv2.location = sv1.location
      left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and (sv3.name <> sv1.name or sv3.location <> sv1.location)
    

    3 Finally, the first row for each id is selected. Then, the row corresponding to the next_rank_no is selected in a recursive fashion.

    with sorted_versions as --ranks are assigned within each id group
    (
      select 
        v1.id,
        v1.name,
        v1.RowStartDate,
        v1.RowEndDate,
        rank() over (partition by v1.id order by v1.RowStartDate) rank_no
      from versions v1
      left join versions v2 on (v1.id = v2.id and v2.RowStartDate = v1.RowEndDate)
    ),
    next_rank as --the maximum rank of the range of ranks which has the same value for NAME
    (
      select 
      sv1.id id, sv1.rank_no rank_no, COALESCE(min(sv3.rank_no)-1 , COALESCE(max(sv2.rank_no), sv1.rank_no)) next_rank_no
      from sorted_versions sv1
      left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name
      left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and sv3.name <> sv1.name
      group by sv1.id, sv1.rank_no
    ),
    versions_cte as --the rowenddate of the "maximum rank" is selected 
    (
      select sv.id, sv.name, sv.rowstartdate, sv3.rowenddate, nr.next_rank_no rank_no
      from sorted_versions sv
      inner join next_rank nr on sv.id = nr.id and sv.rank_no = nr.rank_no and sv.rank_no = 1
      inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no  
      union all
      select
        sv2.id,
        sv2.name, 
        sv2.rowstartdate,
        sv3.rowenddate,
        nr.next_rank_no
      from versions_cte vc
      inner join sorted_versions sv2 on sv2.id = vc.id and sv2.rank_no = vc.rank_no + 1
      inner join next_rank nr on sv2.id = nr.id and sv2.rank_no = nr.rank_no  
      inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no
    )
    select id, name, rowstartdate, rowenddate
    from versions_cte
    order by id, rowstartdate;
    

    SQL Fiddle demo