data-warehousestagingstar-schemaoltp

OLTP-STAGGING-DW-DUPLICATES


I have a project to extract data from an OLTP db to a Stagging star schema and then Datawarehouse db. I have a relationship : playlist<---(1-N)PlaylistTRack----->(1-N)Track--->(N-1)Sales. There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable in Stagging bd, but i dont get how that comes about or if there really is a problem.

As we have searched, we found that we have to do a bridge table. But does that solve the problem if there is an actual one?


Solution

  • There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable

    This is a valid concern. Your playlist and your track are two different conceptual things that relate to each other in a many-to-many way (a playlist can have multiple tracks and a track can appear in more than one playlist). You most certainly would not want to collapse this into a single table or you will have duplicates.

    The bridge table that solves this is your playlisttrack table. The grain of that table is one track on one playlist, and the key is a combination of the keys to both the playlist and track dimensions.

    playlist <--- playlisttrack ---> track
    

    Now, if you have a different table playing the true role of a fact table (sales), then this doesn't change what's needed. Your sales fact table would have a single key to playlist:

    sales (fact) ---> playlist <--- playlisttrack ---> track
    

    Your product dimension has been split into three tables in a pattern known as "snowflaking" but that's unavoidable. In this situation, while your sales table still acts as the real fact table, the bridge table (playlisttrack) is in a way a fact table of its own with its own granularity. It's important to realize this because you have to be careful when joining from sales through the bridge table to track, as that changes the granularity of your query. Unless your query filters on a single track, you would need to use careful GROUP BY logic with appropriate aggregation (MAX vs SUM on sales facts until grouped at the sales fact grain, then SUM only above that grouping, etc..) in any such query to avoid overcounting your sales facts.