data-modelingdata-warehousedimensional-modelingstar-schemadatamart

Star Schema Design for Social Media


I am new to dimensional modeling and have read a lot of material (star-schema, dimension/fact tables, SCD, Ralph Kimball's - The Data Warehouse Toolkit book, etc). So I have a good conceptual understanding of dimensional modeling constructs but finding it hard to apply to a usecase due to lack of experience and need some guidance.

Consider Twitter for example, I want to design a dimensional model to calculate -

  1. DAU (Daily active users) = number of users who logged in and accessed twitter via website or mobile app on a given day
  2. MAU (Monthly active users) = number of users who logged in and accessed twitter via website or mobile app in last 30 days including measurement date
  3. User engagement = total(clicks + favorites + replies + retweets) on a tweet

These metrics over a period (like month) is the summation of these metrics on each day in that period.

I want to write SQLs to calculate these metrics for every quarter by region (eg: US and rest of world) and calculate year-over-year growth (or decline) in these metrics.
Eg: enter image description here

Here are some details that I thought about -

Factless (transaction) fact table for user login activity with grain of 1 row per user per login : user_login_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key)

Factless (transaction) fact table for user activity with grain of 1 row per user per activity : user_activity_fact_schema (user_dim_key, date_dim_key, user_location_dim_key, access_method_dim_key, post_key, activity_type_key)

Does this sounds correct? How should my model look like? What other dimensions/facts can I add here?

Wonder if I should collapse these 2 tables into 1 and have activity_type for logins as 'login', but there can be a huge number of logins without any activity so this will skew the data. Am I missing anything else?


Solution

  • Your model seems correct, it answers the questions on the graph you posted.

    It could make sense to aggregate those two fact tables into one fact table joined with a "UserAction" dimension, mostly because a login can be interpreted as just another user action.

    However, having separate fact tables focused on one metric (or process) may be preferable because it enables you to introduce measures/metrics into the tables, i.e. when your fact tables stop being factless. It also spares you a join with another dimension (UserAction) but that is becoming a bit less relevant these days, where storage and DB processing power are just getting cheaper.