Have a question that is haunting me for some time.
How in practice looks replacing primary keys with surrogate keys during the ETL process? Like what is the workflow - is it just assigning new IDENTITY? If so, how about previous values, how to replace existing business keys with newly created ones?
In my mind a specific workflow looks like below, but I haven't done it in practice yet:
But please tell me how you do this in your job and correct me, because I think I'm wrong.
Let's take the simplest case where your target dimension is being loaded from a single source system. The basic steps would be:
Take the unique identifier for the source system record - normally either the PK or BK
Use this identifier to lookup the corresponding record in the target dimension - which holds this identifier as well as the SK and other attributes - and return the SK if a record is found in the Dim
If an SK is found then you are going to perform an Update on the Dim using the SK as the primary identifier
a. You may also need to perform an insert e.g. if the Dim is SCD2
b. If there have been no changes between the source and target record you may decide not to process the source record
If no SK is found then you will insert a new record into the target Dim, generating a new SK value in one of two main ways:
a. Using the capabilities of the underlying database, such as sequences, auto-increment columns, etc.
b. Using the capabilities of your ETL tool e.g. a sequence generator
These are obviously the logically steps you need to follow. How you actually implement them depends entirely on your ETL/ELT components - so running a merge command in your DB will look very different from an Informatica workflow but "under the covers" both processes are following the same logical steps