data-warehousedimensionalsurrogate-keyrolap

Surrogate Keys in a Data Warehouse


I understand the general concept of a surrogate key in a DWH environment. But there are two aspects I don't understand and couldn't find information about:

  1. Is it common practice that a surrogate key is unique in the whole DWH or unique in one Dimension?
  2. If I have a Dimension with a hierarchy, does that hierarchy influence the generation of the surrogate key?

Solution

  • 1) A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.

    In fact tables, surrogate keys have another application. Because data is often combined from many sources, chances are that you will run into the problem with composite primary keys (several columns combined are used identify each row uniquely), as well as the problem of duplicate business keys (the keys taken from the various source systems). Because surrogate keys are used for lookups, the compactness of it is important. Use an incrementing integer or a fixed length hash, and keep the business key from the source in a separte column.

    2) This questions is difficult to answer because I dont know what software you are using to manage your dimensions and hierarchies. This influences things a lot. On a typical denormalized Kimball architecture, in a dimension table, a surrogate key is used to reference a unique row in the dimension table. In a dimesion table with several hierarhies, the meaning of this can be a bit confusing. The surrogate key will only be truly unique for the hierarchy with the highest cardinality (number of members), as it is this which will determine how many rows will be in the dimension table. So the practise is that the key is unique to the dimension table, AND exactly ONE of the hierarchies in it - the one with the highest number of members. If you add versioning of hierarchies (slowly changing dimensions) to this, the exact meaning of the surrogate key can be deceptive.

    Note/Rant : I generally find the idea of multiple hierarchies in one dimension table quite apalling. True, it reduces the number of dimension references in the fact table, but there are drawbacks. There are several consequences to the denormalization of the dimension table, (the ugly duplication). One of them is the risk of double counting when joining on a dimension table. This is often fixed (or glossed over) by the software packages used, checking if values are the same and then summing them and reducing the count if the are. But this is a common source of counting anomalies and summing errors which can only be handled down the road by really dirty hacks. Of which I have see quite a few.