Okay so here are simplified tables to help explain my situation
FactListingCreated:
ListingCreatedSk
CreatedDateSk
ListingSk
StateSk
DimListing
ListingSk
ListingBk
StateCode
ListingPrice (SCD2)
ListingStatus (SCD2)
RowEffectiveDate
RowExpirationDate
RowCurrentIndicator
The problem I am having is that when I merge updates from my dimension to my basic transactional fact table I end up with duplicate entries (same ListingBK) in my Fact as a result of SCD2 changes in the dimension adding a new row. What is the best way to handle these situations, with the key constraint we have is we want each row in the fact to point at the original Sk in the dimension table.
Current Procedure:
MERGE INTO dbo.FactListingCreated AS target
USING
(
SELECT dlm.CreatedDateSk,
dl.ListingSk,
CASE
WHEN db.BrokerageSk IS NULL THEN -1
ELSE db.BrokerageSk
END as BrokerageSk,
ds.StateSk
FROM stage.DimListingMerge as dlm
LEFT JOIN dbo.DimDate as dd
ON dd.DateSk = dlm.CreatedDateSk
LEFT JOIN dbo.DimListing as dl
ON dl.ListingBk = dlm.ListingBk
AND dl.RowCurrentIndicator = 1
LEFT JOIN dbo.DimBrokerage as db
ON db.BrokerageBk = dlm.BrokerageBk
LEFT JOIN dbo.DimState as ds
ON ds.StateCode = dlm.StateCode
) source
ON (target.ListingSk = source.ListingSk)
THEN UPDATE SET
target.CreatedDateSk = source.CreatedDateSk,
target.BrokerageSk = source.BrokerageSk,
target.StateSk = source.StateSk
WHEN NOT MATCHED THEN
INSERT VALUES
(
source.CreatedDateSk,
source.ListingSk,
source.StateSk
);
So I think this procedure would work on updates (where just the previous day's data is pulled), however, would the best method just be to make a separate initial run (pulls all data from dim) where it pulls the initial row for each record? Or am I missing something very obvious that would make this possible with a single stored procedure?
When you load a fact table that references an SCD2 dimension you need to select the dimension record, from the many with the same BK, that was applicable at the time of the fact "event" date - what that event date is for your facts is determined by your business logic, so it might be a created date, an effective date or something else. Let's assume it is a column called EventDate...
Your SQL JOIN needs to look something like this:
LEFT JOIN dbo.DimListing as dl
ON dl.ListingBk = dlm.ListingBk
AND dlm.EventDate BETWEEN dl.RowEffectiveDate AND dl.RowExpirationDate
At the moment, your SQL is just picking up the current row from the dimension for all your fact records hence, I suspect, the reason you are getting duplicates