For context I have a simple Azure SQL database and I have a theoretical dimensional model with like 12 dimensions and some fact tables that eventually should function as a DWH.
I am looking to apply historization on this dimenional model. To start simple and apply the concept of historization I want to enable it on one dimension. In my case this is the DimEmployee which looks like this with a row of sample data (it has more columns but for simplicity sake I only take these)
EmpKey | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|
52 | John | London | 02/02/2020 | 31/12/9999 | Y |
Now if lets say John moves to a different city I want this to be tracked so the wanted situation when the value of EmpCity is updated would be
EmpKey | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|
52 | John | Amsterdam | 08/06/2021 | 31/12/9999 | Y |
52 | John | London | 02/02/2020 | 08/06/2021 | N |
How would I go about applying this with TSQL. I assume it will have to be a trigger of some sorts but lack the specific knowledge to apply this. I've read about temporal tables and that usually when enabling historization on data you store the historical data in a separate history table. I think because of this models design there can always be expanded on this concept by adding DimEmployeeHist dimensions or whatever but for the first concept this is not needed. I don't know what value it would bring doing it in separate tables other then when you have lot's of history records so you want them all neatly stored in one place to not clutter up your main tables.
Does this have to be done in the create table script or can this be done after the table is created?
Firstly: forget about triggers. Triggers are bad
This is a braindump on how I generally do this. There are many permutations on this but this should give you an idea.
This is a row in the target dimension. Of course there are many other rows
SurrogateKey | SourceKey | SourceSystem | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|---|---|
3678 | 52 | HRSystem1 | John | London | 2020-02-20 | 9999-12-31 | Y |
3642 | 73 | HRSystem1 | Jim | Brisbane | 2021-03-18 | 9999-12-31 | Y |
SurrogateKey
is the key that joins to your fact and is unique in the dimension table (and should be enforced with a constraint or index)SourceKey
is the key in the source systemSourceSystem
is a code for whatever system provided this record.SourceKey
+ SourceSystem
is unique (and should be enforced with a constraint or index)We load data into our datawarehouse. Typically the first step is to load the input data into a staging table. So say we have this in a staging table:
EmpKey | EmpName | EmpCity | ActiveRecordSurrogateKey | Ignore |
---|---|---|---|---|
52 | John | Amsterdam | NULL | NULL |
73 | Jim | Brisbane | NULL | NULL |
7 | Jack | Texas | NULL | NULL |
ActiveRecord
and Ignore
are working columns, they don't come from the source. Every other column does come from the source but knows nothing about the dimension
There will obviously be many records in the dimension and staging table.
First exclude all records that are up to date in the dimension.
UPDATE StagingTable
SET Ignore = 'Y'
FROM StagingTable TGT
INNER JOIN DimensionTable SRC
ON TGT.EmpKey = SRC.SourceKey
AND TGT.SourceKey = 'HRSystem1'
AND TGT.EmpName=SRC.EmpName
AND TGT.EmpCity=SRC.EmpCity
AND SRC.IsActive = 'Y'
We've identified that Jim has not changed and can be ignored
EmpKey | EmpName | EmpCity | ActiveRecordSurrogateKey | Ignore |
---|---|---|---|---|
52 | John | Amsterdam | NULL | NULL |
73 | Jim | Brisbane | NULL | Y |
7 | Jack | Texas | NULL | NULL |
identify all the records in the staging table that already have an active record in the dimension but have different attributes
UPDATE StagingTable
SET ActiveRecordSurrogateKey = SRC.SurrogateKey
FROM StagingTable TGT
INNER JOIN DimensionTable SRC
ON TGT.EmpKey = SRC.SourceKey
AND TGT.SourceKey = 'HRSystem1'
AND TGT.IsActive='Y'
AND (TGT.EmpName<>SRC.EmpName OR TGT.EmpCity<>SRC.EmpCity)
(These two seperate updates could be combined into one if you want.)
Now our staging table looks like this.
EmpKey | EmpName | EmpCity | ActiveRecordSurrogateKey | Ignore |
---|---|---|---|---|
52 | John | Amsterdam | 3678 | NULL |
73 | Jim | Brisbane | NULL | Y |
7 | Jack | Texas | NULL | NULL |
Now we have enough info to update the dimension. We can write some more SQL to apply to the dimension based on the helper columns.
But first, Lets specify a fixed date. This stops weird things happening if this runs before and after midnight. Or you might want to determine this some other way, like an input parameter or data in the source
DECLARE @Date DATE = GETDATE();
Now we can insert all new records (brand new or changed)
-- This line inserts new dimension records:
INSERT INTO DimensionTable (SourceKey,SourceSystem,EmpName,EmpCity, StartDate,EndDate,IsActive)
SELECT EmpKey,'HRSystem1',EmpName,EmpCity, @Date,'2999-01-01','Y'
FROM StagingTable
WHERE Ignore IS NULL
Now our dimension looks like this
SurrogateKey | SourceKey | SourceSystem | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|---|---|
3678 | 52 | HRSystem1 | John | London | 2020-02-20 | 9999-12-31 | Y |
3642 | 73 | HRSystem1 | Jim | Brisbane | 2021-03-18 | 9999-12-31 | Y |
3693 | 7 | HRSystem1 | Jack | Texas | 2021-06-09 | 9999-12-31 | Y |
3694 | 52 | HRSystem1 | John | Amsterdam | 2021-06-09 | 9999-12-31 | Y |
Now we end date existing records:
-- This line end-dates existing records:
UPDATE DimensionTable
SET DWEndDate = @Date, Active = 'N'
FROM DimensionTable TGT
INNER JOIN StagingTable SRC
ON TGT.SurrogateKey = ActiveRecordSurrogateKey
SurrogateKey | SourceKey | SourceSystem | EmpName | EmpCity | DWHDatStart | DWHDatEnd | IsActive |
---|---|---|---|---|---|---|---|
3678 | 52 | HRSystem1 | John | London | 2020-02-20 | 2021-06-09 | N |
3642 | 73 | HRSystem1 | Jim | Brisbane | 2021-03-18 | 9999-12-31 | Y |
3693 | 52 | HRSystem1 | Jack | Texas | 2021-06-09 | 9999-12-31 | Y |
3694 | 52 | HRSystem1 | John | Amsterdam | 2021-06-09 | 9999-12-31 | Y |
So you basically wrap all of those T-SQL statements up in one stored procedure, add some transactions, logging, and error handling.
CREATE PROC pUpdateDimPerson
AS
BEGIN
-- All the code above
END
The stored procedure compares source (staging) and target (dimension) and does all the right things.
There are a bunch of other things to consider but this gives you an idea.