sqlapache-sparkdata-modelingdata-warehousedimensional-modeling

Data Modeling - Slow Changing Dimension type 2: How to deal with schema change (column added)?


What is the best practice to deal with schema-changing when building a Slow Changing Dimension table?

For example, a column was added:

First state:
+----------+---------------------+-------------------+
|customerId|address              |updated_at         |
+----------+---------------------+-------------------+
|1         |current address for 1|2018-02-01 00:00:00|
+----------+---------------------+-------------------+


New state with new column, but every other followed column constant:
+----------+---------------------+-------------------+------+
|customerId|address              |updated_at         |newCol|
+----------+---------------------+-------------------+------+
|1         |current address for 1|2018-03-03 00:00:00|1000  |
+----------+---------------------+-------------------+------+

My first approach is to think that schema-changing means the row has changed. So I would add a new row to my SCD table:

+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|customerId|address              |updated_at         |newCol|active_status|active_status_start|active_status_end  |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|1         |current address for 1|2018-02-01 00:00:00|null  |false        |2018-02-01 00:00:00|2018-03-03 00:00:00|
|1         |current address for 1|2018-03-03 00:00:00|1000  |true         |2018-03-03 00:00:00|null               |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+

But, what if the columns were added, but for some specific row the value is null? For example, for row with customerId = 2, it is null:

+----------+---------------------+-------------------+------+
|customerId|address              |updated_at         |newCol|
+----------+---------------------+-------------------+------+
|2         |current address for 2|2018-03-03 00:00:00|null  |
+----------+---------------------+-------------------+------+

In this case, I can take two approaches:

  1. Consider every schema change as a row change, even for null rows (much easier to implement, but costlier from a storage perspective). It would result in:
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address              |updated_at         |active_status|active_status_end  |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1         |current address for 1|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|1         |current address for 1|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|1000  |
|2         |current address for 2|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|2         |current address for 2|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|null  |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+

  1. Do a check for every row, and if it has an actual value for this new column, add it; otherwise, don't do anything to this row (for now, I didn't come up with implementation to it, but it is much more complicated and likely to be error-prone). The result in SCD table for row 2 would be 'row has not changed':
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address              |updated_at         |active_status|active_status_end  |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1         |current address for 1|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|1         |current address for 1|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|1000  |
|2         |current address for 2|2018-02-01 00:00:00|true         |null               |2018-02-01 00:00:00|null  |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+


The second approuch seems more "correct", but am I right? Also, implement approuch 1 is much simpler. Approuch 2 would need some more complicated and has other trade-offs, for example: a) What if instead of adding a columns, a columnd was droped? b) From a query persperctive it is much more costlier.

I have done research on the subject and didn't fount this kind of situation being treated.

What is the standard approach to it? Trade-offs? Is there another approach I am missing here?

Thank you all.


Solution

  • Thanks for @MarmiteBomber and @MatBailie comments. Based on your comments I ended up implementing the second option, because (summary of your thoughts):

    1. The second approach is the only one meaningful.
    2. Implementation is a consequence of business logic, not necessarily a standard practice. In our case, we didn't need to differentiate types of nulls, so the right approach was to encapsulate known non-existing values as null, as well as unknown values, etc.
    3. Be explicit.

    The second approach also needed to add a check (is the new column present in the row?) in write time, but it saves complexity in query time, and storage. Since SCD is "slow" and this case is rare (schema changes happen, but not "every day"), adding the check in write time is better than in query time.