etldata-warehouseolapmondrianrolap

What could be incorrect in this multidimensional model?


I want to build a simple multidimensional data model by using the star schema in a relational database (ROLAP). For that I create a fact table and two tables of dimensions. Firstly I copy the data from the operational source and handle this data (some simplified ETL process).

In my model only two dimensions: date and status. Measure: the number of certain statuses (for a time).

The time dimension table:

CREATE TABLE [dbo].[tbl_date_dim] (    
    [ID][int]       IDENTITY(1,1) NOT NULL,
    [date_key][int] NOT NULL primary key,
    [Year][int]     NOT NULL,
    [Month][int]    NOT NULL,
    [Day][int]      NOT NULL        
);

There is a table - tbl_application - in which is stored the whole time range (field VersionDate). Therefore, the time dimension table I'm filling this way:

INSERT INTO [dbo].[tbl_date_dim] 
    ([date_key], 
    [Year], 
    [Month], 
    [Day]) 
(
  SELECT DISTINCT
    CAST(YEAR(VersionDate) as VARCHAR(4)) + 
    RIGHT('00' + CAST(MONTH(VersionDate) as VARCHAR(2)) ,2) +
    RIGHT('00' + CAST(DAY(VersionDate) as VARCHAR(2)), 2) as 'date_key',
    YEAR(inner_data.VersionDate)    as 'Year',
    MONTH(inner_data.VersionDate)   as 'Month', 
    DAY(inner_data.VersionDate)     as 'Day'
  FROM (
        SELECT 
            VersionDate 
        FROM [dbo].[tbl_application]
  ) AS inner_data
);

The status dimension table: I use whole existing table tbl_applicationstatus.

Next, I create a fact table. It contains foreign keys to dimension tables and measures.

CREATE TABLE [dbo].[tbl_olap_fact] (
    [ID][int] IDENTITY(1,1) NOT NULL,    

    [status_id][int] NOT NULL,           // FK  
    [date_dim][int] NOT NULL,            // FK

    [staus_name] varchar(100) NOT NULL, // Non additive measure
    [transaction_id][int] NOT NULL,     // Additive measure

    CONSTRAINT [PK_tbl_olap_fact] PRIMARY KEY CLUSTERED
(
    [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY];

transaction_id - this field, which I will aggregate (number of statuses).

Next, I add the relationship between the fact table and dimension tables:

ALTER TABLE [dbo].[tbl_olap_fact] ADD CONSTRAINT [FK_tbl_olap_fact_tbl_date_dim] FOREIGN KEY([date_dim])
REFERENCES [dbo].[tbl_date_dim] ([date_key]);

ALTER TABLE [dbo].[tbl_olap_fact] ADD CONSTRAINT [FK_tbl_olap_fact_tbl_applicationstatus] FOREIGN KEY([status_id])
REFERENCES [dbo].[tbl_applicationstatus] ([ID]);

Then I fill the fact table:

INSERT INTO [dbo].[tbl_olap_fact] 
    ([transaction_id], 
    [status_id], 
    [staus_name], 
    [date_dim]) 
(
  SELECT DISTINCT
    core.id          as 'transaction_id',
    core_status.ID   as 'status_id',
    core_status.name as 'status_name',
    CAST(YEAR(core.VersionDate) as VARCHAR(4)) + 
    RIGHT('00' + CAST(MONTH(core.VersionDate) as VARCHAR(2)) ,2) +
    RIGHT('00' + CAST(DAY(core.VersionDate)   as VARCHAR(2)), 2) as 'date_dim' 
  FROM 
    [dbo].[tbl_application] as core
        inner join tbl_applicationstatus as core_status
         on core.ApplicationStatusID = core_status.ID
  WHERE IsRaw = 0
);

As the OLAP server I'm using Mondrian. Mondrian schema that defines the logical model of the multidimensional database:

<Schema name="olap_schema">
  <Dimension type="TimeDimension" visible="true" highCardinality="false" name="Date first dim">
    <Hierarchy name="date_hierarchy" visible="true" hasAll="true" primaryKey="date_key" description="">

      <Table name="tbl_date_dim" schema="dbo">
      </Table>

      <Level name="" 
            visible="true" 
            table="tbl_date_dim" 
            column="Year" 
            nameColumn="Year" 
            type="Numeric" 
            uniqueMembers="true" 
            levelType="TimeYears" 
            hideMemberIf="Never" 
            description="">         
      </Level>

      <Level name="" 
             visible="true" 
             table="tbl_date_dim" 
             column="Month" 
             nameColumn="Month" 
             ordinalColumn="Month" 
             type="Numeric" 
             uniqueMembers="false" 
             levelType="TimeMonths" 
             hideMemberIf="Never" 
             description="">
      </Level>

      <Level name="" 
             visible="true" 
             table="tbl_date_dim" 
             column="Day" 
             nameColumn="Day" 
             ordinalColumn="Day" 
             type="Numeric" 
             uniqueMembers="false" 
             levelType="TimeDays" 
             hideMemberIf="Never" 
             description="">
      </Level>

    </Hierarchy>
  </Dimension>

  <Dimension type="TimeDimension" visible="true" highCardinality="false" name="Date second dim">
    <Hierarchy name="date_hierarchy" visible="true" hasAll="true" primaryKey="date_key" description="">
      <Table name="tbl_date_dim" schema="dbo">
      </Table>

      <Level name="" 
             visible="true" 
             table="tbl_date_dim" 
             column="Year" 
             nameColumn="Year" 
             type="Numeric" 
             uniqueMembers="true" 
             levelType="TimeYears" 
             hideMemberIf="Never" 
             description="">
      </Level>

      <Level name="" 
             visible="true" 
             table="tbl_date_dim" 
             column="Month" 
             nameColumn="Month" 
             ordinalColumn="Month" 
             type="Numeric" 
             uniqueMembers="false" 
             levelType="TimeMonths" 
             hideMemberIf="Never" 
             description="">
      </Level>

      <Level name="" 
             visible="true" 
             table="tbl_date_dim" 
             column="Day" 
             nameColumn="Day" 
             ordinalColumn="Day" 
             type="Numeric" 
             uniqueMembers="false" 
             levelType="TimeDays" 
             hideMemberIf="Never" 
             description="">
      </Level>

    </Hierarchy>
  </Dimension>

  <Dimension type="StandardDimension" visible="true" highCardinality="false" name="Status dimension">
    <Hierarchy name="status_hierarchy" visible="true" hasAll="true" primaryKey="ID" description="">
      <Table name="tbl_applicationstatus" schema="dbo">
      </Table>
      <Level name="" 
             visible="true" 
             table="tbl_applicationstatus" 
             column="Name" 
             nameColumn="Name" 
             type="String" 
             uniqueMembers="true" 
             levelType="Regular" 
             hideMemberIf="Never" 
             description="">
      </Level>
    </Hierarchy>
  </Dimension>

  <Cube name="enrollment_cube" caption="" visible="true" description="" cache="true" enabled="true">
    <Table name="tbl_olap_fact" schema="dbo">
    </Table>

    <DimensionUsage source="Date first dim" name="X axis" caption="" visible="true" foreignKey="date_dim" highCardinality="false">
    </DimensionUsage>

    <DimensionUsage source="Date second dim" name="Y axis" caption="" visible="true" foreignKey="date_dim" highCardinality="false">
    </DimensionUsage>

    <DimensionUsage source="Status dimension" name="Z axis" caption="" visible="true" foreignKey="status_id" highCardinality="false">
    </DimensionUsage>

    <Measure name="TotalCount" column="transaction_id" aggregator="count" caption="Total" visible="true">
    </Measure>

  </Cube>

</Schema>

As the OLAP client I'm using Saiku Analytics.

enter image description here

Basically, I get the correct data - but not quite sure in it. For example, does the correct that way that I use to populate the fact table? Am I building ETL process properly? This is a test mode and I make some experiments in building data warehouses and multidimensional models.


Solution

  • A disclaimer here: I've never used Mondrian, and the advice I'll give here is generic, sticking close to the Kimball way of doing things. If Mondrian requires some particular changes to the schema, then go for it.


    tbl_date_dim

    This is a good start - a date dimension is critical. You might find you want a separate time dimension as well (in case you want to look at things by hour, for instance).

    I would make to this would be to remove the ID column. What purpose does it serve? Each YYYYMMDD value will be unique, and the standard pattern here is to simply use that as the surrogate key for that table.

    You'll probably find you want to add some more columns to this table as well; for now, I'd suggest adding a date column with the actual date, as this is the business key for this dimension. You should always have both a surrogate key and a business key in each dimension.

    Read up on the concept of a business key if you're not already familiar with this term - it's basically a meaningful name that you organisation would use when referring to a particular dimension member, often a name of some kind. People frequently make the mistake of using a meaningless code or unhelpful abbreviation from the source system, but these should really be avoided.

    I would suggest populating tbl_date_dim differently. Sure, the way you're doing it will work for now, but your date dimension will end up being referred to by many other tables, and you might find it's missing dates you need. The standard solution is just to script this up, or even throw together a spreadsheet and import it, and to include a suitable ranges of dates into the past and future. It's never that big, so size isn't really an issue. If you want to script it, you should be able to find scripts that do the work for you with a bit of searching.

    tbl_applicationstatus

    It's hard to comment on this as you don't show the DDL. You quite possibly shouldn't be using the entire source table, though. Make sure you have a surrogate key created in your data warehouse (an identity column is fine, name it something like application_status_key), and the business key. This is probably status_name.

    tbl_olap_fact

    A fact table should be one or more measures at the same grain, and foreign keys to dimension tables. Understanding the grain is critical, and you should think about what the grain is and then give the fact a meaningful name which reflects it. If your fact is going to have one of more measures related to transactions, then tbl_transaction_fact might be a good name, for instance.

    It's a little unclear what you're trying to measure here as you don't explain what data is in the tbl_application source table, but it looks like you're trying to perhaps count the number of transactions carried out while a certain application status was set? Note that you don't actually have any additive measues here; an additive measure is something that can be summed - a monetary amount, a number of items, etc.

    If you're just doing this as an example, I strongly recommend that you first think about a question you want your cube to answer which involves something that is additive (i.e. something along the lines of "How much are all applications created in January worth?" if your applications have a monetary value), and then model something that will allow that question to be answered.

    You can absolutely do a count as well, but importing the transaction_id value for the source probably isn't necessary - you could just count your ID column.

    You should remove status_name from the fact table as you should get to that by linking to your status dimension, and using the name column from there. The status name is the business key from that dimension, rather than a non-additive measure.

    When you're populating the fact, the usual pattern is to work with the business keys, and then pick up the surrogate keys from either the dimensions themselves or from look-up tables, then load the fact with just the surrogate keys pointing to the dimensions.


    There's a really handy Kimball guide which gives a brief overview of various techniques. It's really handy as both an initial place to look up concepts, and also to refer back to when you need a reminder - I'd suggest giving it a read through and saving it.