I have a time series data set. For each date, I should have the following:
Category | Sub-category | Value |
---|---|---|
A | A1 | 1 |
A | A2 | 2 |
B | B1 | 3 |
B | B2 | 4 |
However, I do have sometimes missing values, something like:
Category | Sub-category | Value |
---|---|---|
A | A1 | 1 |
B | B2 | 4 |
What I want to do is identify missing values for each date. So with the above, it would be:
Category | Sub-category | Value |
---|---|---|
A | A1 | 1 |
A | A2 | NULL |
B | B1 | NULL |
B | B2 | 4 |
To do that, I have created a table like this in Snowflake called scaffold
:
Category | Sub-category |
---|---|
A | A1 |
A | A2 |
B | B1 |
B | B2 |
And I have written the following query:
select
a.date
,b.category
,b.subcategory
,a.value
from main_table a
right outer join scaffold b
on a.category=b.category
and a.subcategory=b.subcategory
Strangely, my output stays the same as illustrated below - it's as if I wasn't using this scaffold:
Category | Sub-category | Value |
---|---|---|
A | A1 | 1 |
B | B2 | 4 |
I've tried changing the join to a full outer join, to no avail. Any help appreciated!
I would recommend cross joining the expected categories / subcategories (scaffold
) with the list of distinct dates that are available in the main table.
This gives you all possible combinations (ie all rows of the resultset); then, the main table can be brought with a left join
- when the join does not match, the value
end up as null
:
select d.date, s.category, s.subcategory, m.value
from (select distinct date from main_table) d
cross join scaffold s
left join main_table m
on m.date = d.date
and m.category = s.category
and m.subcategory = s.subcategory
Ideally you would have a separate table that stores the dates only, which we would use instead of select distinct
in subquery d
.