I have a project to extract data from an OLTP db to a Stagging star schema and then Datawarehouse db. I have a relationship : playlist<---(1-N)PlaylistTRack----->(1-N)Track--->(N-1)Sales. There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable in Stagging bd, but i dont get how that comes about or if there really is a problem.
As we have searched, we found that we have to do a bridge table. But does that solve the problem if there is an actual one?
There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable
This is a valid concern. Your playlist and your track are two different conceptual things that relate to each other in a many-to-many way (a playlist can have multiple tracks and a track can appear in more than one playlist). You most certainly would not want to collapse this into a single table or you will have duplicates.
The bridge table that solves this is your playlisttrack
table. The grain of that table is one track on one playlist, and the key is a combination of the keys to both the playlist and track dimensions.
playlist <--- playlisttrack ---> track
Now, if you have a different table playing the true role of a fact table (sales), then this doesn't change what's needed. Your sales
fact table would have a single key to playlist
:
sales (fact) ---> playlist <--- playlisttrack ---> track
Your product dimension has been split into three tables in a pattern known as "snowflaking" but that's unavoidable. In this situation, while your sales
table still acts as the real fact table, the bridge table (playlisttrack
) is in a way a fact table of its own with its own granularity. It's important to realize this because you have to be careful when joining from sales
through the bridge table to track
, as that changes the granularity of your query. Unless your query filters on a single track, you would need to use careful GROUP BY
logic with appropriate aggregation (MAX
vs SUM
on sales facts until grouped at the sales fact grain, then SUM
only above that grouping, etc..) in any such query to avoid overcounting your sales facts.