What is the best practice to deal with schema-changing when building a Slow Changing Dimension table?
For example, a column was added:
First state:
+----------+---------------------+-------------------+
|customerId|address |updated_at |
+----------+---------------------+-------------------+
|1 |current address for 1|2018-02-01 00:00:00|
+----------+---------------------+-------------------+
New state with new column, but every other followed column constant:
+----------+---------------------+-------------------+------+
|customerId|address |updated_at |newCol|
+----------+---------------------+-------------------+------+
|1 |current address for 1|2018-03-03 00:00:00|1000 |
+----------+---------------------+-------------------+------+
My first approach is to think that schema-changing means the row has changed. So I would add a new row to my SCD table:
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|customerId|address |updated_at |newCol|active_status|active_status_start|active_status_end |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|1 |current address for 1|2018-02-01 00:00:00|null |false |2018-02-01 00:00:00|2018-03-03 00:00:00|
|1 |current address for 1|2018-03-03 00:00:00|1000 |true |2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
But, what if the columns were added, but for some specific row the value is null? For example, for row with customerId = 2, it is null:
+----------+---------------------+-------------------+------+
|customerId|address |updated_at |newCol|
+----------+---------------------+-------------------+------+
|2 |current address for 2|2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+------+
In this case, I can take two approaches:
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address |updated_at |active_status|active_status_end |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1 |current address for 1|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|1 |current address for 1|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|1000 |
|2 |current address for 2|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|2 |current address for 2|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|null |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address |updated_at |active_status|active_status_end |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1 |current address for 1|2018-02-01 00:00:00|false |2018-03-03 00:00:00|2018-02-01 00:00:00|null |
|1 |current address for 1|2018-03-03 00:00:00|true |null |2018-03-03 00:00:00|1000 |
|2 |current address for 2|2018-02-01 00:00:00|true |null |2018-02-01 00:00:00|null |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
The second approuch seems more "correct", but am I right? Also, implement approuch 1 is much simpler. Approuch 2 would need some more complicated and has other trade-offs, for example: a) What if instead of adding a columns, a columnd was droped? b) From a query persperctive it is much more costlier.
I have done research on the subject and didn't fount this kind of situation being treated.
What is the standard approach to it? Trade-offs? Is there another approach I am missing here?
Thank you all.
Thanks for @MarmiteBomber and @MatBailie comments. Based on your comments I ended up implementing the second option, because (summary of your thoughts):
The second approach also needed to add a check (is the new column present in the row?) in write time, but it saves complexity in query time, and storage. Since SCD is "slow" and this case is rare (schema changes happen, but not "every day"), adding the check in write time is better than in query time.