I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:
The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:
Day | Visits | Age | Gender | Source |
---|---|---|---|---|
19/04/2022 | 5 | 18-24 | Male | A |
19/04/2022 | 10 | 18-24 | Female | A |
19/04/2022 | 2 | NULL | Male | B |
19/04/2022 | 8 | NULL | Female | B |
19/04/2022 | 10 | 18-24 | NULL | B |
(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).
Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.