etldata-warehousescd

Design a dimension with multiple data sources


I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source.

My Example: I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.

In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.

How do you design a SCD accomodate changes from both systems to the dimension?

We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.


Solution

  • Not necessarily an answer but here are my thoughts:

    You've already covered the real options in your comment. Either:

    A. Merge it beforehand

    You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data

    OR

    B. Merge it in the dimension

    Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension

    However you have two constraints that I feel makes the choice between A & B clearer

    Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple

    Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match

    So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.

    You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.