database-designarchitecturedata-warehousedatabricksdata-lake

Data Lake with Kimball's Star Schema and Data Mart


Objective

I little bit confused by terminology: I've built Data Lake (not DW) based on Kimball's data modeling approaches and now not sure if I can use Data Mart definition to name my MPP database layer.

I came from the assumption that you still need Dimensional Modeling and Star Schema for mid+ size organization reports, same reasoning as in this article.

Questions

  1. Is it right to call Synapse a Data Mart at the following architecture (see picture below)?
  2. Can I say that I don't have DW (even if I have Star Schema), but instead I have Data Lake + Data Mart(s)?
  3. Shall I split Synapse into multiple schemas based on business/reports sub-domains (multiple Data Marts)?

Architecture details

enter image description here

To be more specific, in my case:

2-3) ADLS + Databricks form Data Lake. All ETL and Star Schema build happens at Data Lake layer. All logic seats here. Still it has structured and unstructured data at raw layer, use cheap ADLS storage, lack Governance, has ML and will have streaming in the future. In other hand, we have schema-on-write in all DL zones except raw, we have tables modeled upfront (with a lot of requirements changes during the process). Am I right to call it Data Lake?

4.) Synapse serves as a tiny projection/model of ETL/Lake results in order to speed up reports response time. Almost zero logic here, few aggregations. Only final model is loaded to Synapse. Data are not splitted by business sub-domains, we just load everythin in a single DATAMART schema. Is that a good approach?


Solution

  • Firstly, I wouldn't get too bogged down in definitions as there are loads of (slightly) different definitions of these terms. However, given that, I would give a high-level definition of these terms as follows:

    1. Data Lake: this is your source data loaded into data store where you can start to analyse it. It is normally structured in the same way as it is in the source systems (i.e. it is the "raw" data) plus, optionally, some auditing columns to show where the data came from, when it was loaded, etc. Some data lakes have multiple layers e.g. the raw data layer and then a governed data layer where the data has been cleansed, standardised, etc. - but is still in basically the same structure as in the raw data layer

    2. Data Warehouse: this is your Kimball model of all your fact and dimension tables (plus other tables such as bridges). It will be built from the data that exists in your data lake

    3. Data Mart: this is a subject area sourced from your data warehouse. This might be a logical definition (e.g. the Sales mart is the Sales fact table and related dimensions) or it might be physicalised e.g. a single wide table generated from a fact and its dimensions. How you define your datamarts is normally dependent on who/what is consuming them and what their requirements are. For example, you could have multiple Sales Data marts, all based off the same Sales Star, because you have multiple tools that prefer to consume data structured in particular ways

    Hope this helps?