sqldata-warehousescd2

Duplicate Rows in Fact Table from SCD2 changes in Dimension


Okay so here are simplified tables to help explain my situation

FactListingCreated:

ListingCreatedSk
CreatedDateSk
ListingSk
StateSk

DimListing

ListingSk
ListingBk
StateCode
ListingPrice (SCD2)
ListingStatus (SCD2)
RowEffectiveDate
RowExpirationDate
RowCurrentIndicator

The problem I am having is that when I merge updates from my dimension to my basic transactional fact table I end up with duplicate entries (same ListingBK) in my Fact as a result of SCD2 changes in the dimension adding a new row. What is the best way to handle these situations, with the key constraint we have is we want each row in the fact to point at the original Sk in the dimension table.

Current Procedure:

MERGE INTO dbo.FactListingCreated AS target
    USING 
        (
            SELECT       dlm.CreatedDateSk,
                         dl.ListingSk,
                         CASE
                             WHEN db.BrokerageSk IS NULL THEN -1
                             ELSE db.BrokerageSk 
                         END as BrokerageSk,
                         ds.StateSk

            FROM         stage.DimListingMerge as dlm 

            LEFT JOIN    dbo.DimDate as dd
            ON           dd.DateSk = dlm.CreatedDateSk

            LEFT JOIN    dbo.DimListing as dl
            ON           dl.ListingBk = dlm.ListingBk
            AND          dl.RowCurrentIndicator = 1

            LEFT JOIN    dbo.DimBrokerage as db
            ON           db.BrokerageBk = dlm.BrokerageBk

            LEFT JOIN    dbo.DimState as ds
            ON           ds.StateCode = dlm.StateCode
        ) source  
    ON (target.ListingSk = source.ListingSk)
    THEN UPDATE SET
        target.CreatedDateSk = source.CreatedDateSk,
        target.BrokerageSk = source.BrokerageSk,
        target.StateSk = source.StateSk
    WHEN NOT MATCHED THEN
      INSERT VALUES
      (
        source.CreatedDateSk,
        source.ListingSk,
        source.StateSk
      );

So I think this procedure would work on updates (where just the previous day's data is pulled), however, would the best method just be to make a separate initial run (pulls all data from dim) where it pulls the initial row for each record? Or am I missing something very obvious that would make this possible with a single stored procedure?


Solution

  • When you load a fact table that references an SCD2 dimension you need to select the dimension record, from the many with the same BK, that was applicable at the time of the fact "event" date - what that event date is for your facts is determined by your business logic, so it might be a created date, an effective date or something else. Let's assume it is a column called EventDate...

    Your SQL JOIN needs to look something like this:

    LEFT JOIN    dbo.DimListing as dl
    ON           dl.ListingBk = dlm.ListingBk
    AND          dlm.EventDate BETWEEN dl.RowEffectiveDate AND dl.RowExpirationDate
    

    At the moment, your SQL is just picking up the current row from the dimension for all your fact records hence, I suspect, the reason you are getting duplicates