sqlsql-serverdata-warehousedimensionstype-2-dimension

Type II dimension joins


I have the following table lookup table in OLTP

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    TransactionStateName VarChar (100)
)

When this comes into my OLAP, I change the structure as follows:

CREATE TABLE TransactionState
(
    TransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

My question is regarding the TransactionStateId column. Over time, I may have duplicate TransactionStateId values in my OLAP, but with the combination of StartDateTime and EndDateTime, they would be unique.

I have seen samples of Type-2 Dimensions where an OriginalTransactionStateId is added and the incoming TransactionStateId is mapped to it, plus a new TransactionStateId IDENTITY field becomes the PK and is used for the joins.

CREATE TABLE TransactionState
(
    TransactionStateId INT IDENTITY (1, 1) NOT NULL,
    OriginalTransactionStateId INT NOT NULL, /* not an IDENTITY column in OLAP */
    TransactionStateName VarChar (100) NOT NULL,
    StartDateTime DateTime NOT NULL,
    EndDateTime NULL
)

Should I go with bachellorete #2 or bachellorete #3?


Solution

  • By this phrase:

    With the combination of StartDateTime and EndDateTime, they would be unique.

    you mean that they never overlap or that they satisfy the database UNIQUE constraint?

    If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".

    If the latter, then just use a fake identity.

    Databases in general do not allow an efficient algorithm for this query:

    SELECT  *
    FROM    TransactionState
    WHERE   @value BETWEEN StartDateTime AND EndDateTime
    

    , unless you do arcane tricks with SPATIAL data.

    That's why you'll have to use this condition in a JOIN:

    SELECT  *
    FROM    factTable
    CROSS APPLY
            (
            SELECT  TOP 1 *
            FROM    TransactionState
            WHERE   StartDateTime <= factDateTime
            ORDER BY
                    StartDateTime DESC
            )
    

    , which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.

    See this article for more details on this approach:

    Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).

    Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:

    WITH    cal AS
            (
            SELECT CAST('2009-01-01' AS DATE) AS cdate
            UNION ALL
            SELECT DATEADD(day, 1, cdate)
            FROM   cal
            WHERE  cdate <= '2009-03-01'
            ),
            state AS
            (
            SELECT  cdate, ts.*
            FROM    cal
            CROSS APPLY
                    (
                    SELECT  TOP 1 *
                    FROM    TransactionState
                    WHERE   StartDateTime <= cdate
                    ORDER BY
                            StartDateTime DESC
                    ) ts
            WHERE   ts.EndDateTime >= cdate
            )
    SELECT  *
    FROM    factTable
    JOIN    state
    ON      cdate = DATE(factDate)
    

    If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.