data-warehousestar-schema

How to get the precise surrogate key for the incomming fact records


I am building a data warehouse for my company. Recently, I just realized that there are some holes (potentially very dangerous) in my SCD type 2 dimension implementation, so that I have to review it over.

The current "fromdate" of a SCD type 2 dimension table is the date it came to the data warehouse, or the date that it replaced an older record, and the "todate" is usually null, or the date a new record with a same natural key came in replacing the old record.

Currently, when loading fact, I get the surrogate key for that fact by using the natural key and condition iscurrent = true or todate = null.

I just realize that this doesn't guarantee the correctness of the surrogate key for fact, for example:

  1. What if the change happened at 11:00AM. This means: half of the transaction occured during that day will be related to the old dimension record, but half of them will related to the new dimension record. But when the data comes to the data warehouse, all the transactions of that day will be treated to be related to the new dimension, and this is not correct.

  2. If we use datetime of the transactions to get the surrogate key more precisely, when loading fact records to the data warehouse, all the transaction that occured before the day the dimension comes to the Data Warehouse will not be able to find any dimension surrogate key related to it. For example: I made the dimension table yesterday, so all the start-date in that SCD 2 dimension table will have a min value of yesterday, while nearly all the old transactions (which haven't been loaded to the data warehouse) happened before that day. So they will have no surrogate key. Such paradox.

  3. I even try to make it more precise by consolidate the start-date of a row, by trying to pass the create date of that dimension row in the OLTP system. But still I can not find the most correct way to do it. First the datetime in Data Warehouse and the OLTP system is different (because they might belong to different GMT+X)...

  4. And many other problems .....

I understand that if we want to track the history in a perfectly precise accuracy, the only way is that we must implement it in the OLTP system by directly writing the related entity to the transaction records. Data warehouse can not do it. But I still feel that there are too many holes in the SCD 2 concept, or that I didn't implement the SCD Typ2 2 system correctly. So please teach me if the above problems is normal, or point the mistake in my understanding out for me.


Solution

    1. If time matters, use a datetime not a date. But first consider whether time matters

    2. Again solved by use of datetime

    3. Decide what timezone your datawarehouse is in.

      • UTC
      • Source System
      • Local system
      • A timezone aware data type

    Just a note: I suggest you use 2099-01-01 rather than NULL as your end date for the current record. Then you can easily use between when searching for a matching dimension member.

    1. You'll need to be more specific

    Edit:

    One observation based on the comments so far: Don't use Is_Current to look up the surrogate key, use the business key in the transaction and the transaction datetime between the dimension start and end date.

    This means you can reload data from three months ago and it will pick up the correct dimension member (not the current one)

    This reinforces my other comment to not use NULL for the active record end date. Instead use a datetime way into the future. so you can always between these dates and get a result