I'm looking to pivot a transactional data set into SCD2 in such a way that captures the intervals in which each combination at the pivot grain were effective.
Snowflake is the actual DBMS I'm using, but tagging Oracle too because their dialects are nearly identical. I could probably finagle a solution provided for any DBMS though.
I have working sql, but this it's born out of trial&error and I feel like there has to be a more elegant way that I'm missing because it's very ugly and computationally expensive.
(Note: The second record in the input data "expires" the first. It can be assumed that every day of interest will occur as an add_dts at least once.) (added as image at the end until I can figure out why the markup isn't working)
Input:
Original_Grain | Pivot_Grain | Pivot_Column | Pivot_Attribute | ADD_TS |
---|---|---|---|---|
OG-1 | PG-1 | First_Col | A | 2020-01-01 |
OG-1 | PG-1 | First_Col | B | 2020-01-02 |
OG-2 | PG-1 | Second_Col | A | 2020-01-01 |
OG-3 | PG-1 | Third_Col | C | 2020-01-02 |
OG-3 | PG-1 | Third_Col | B | 2020-01-03 |
Output:
Pivot_Grain | First_Col | Second_Col | Third_Col | From_Dt | To_Dt |
---|---|---|---|---|---|
PG-1 | A | A | NULL | 2020-01-01 | 2020-01-02 |
PG-1 | B | A | C | 2020-01-02 | 2020-01-03 |
PG-1 | B | A | B | 2020-01-03 | 9999-01-01 |
WITH INPUT AS
( SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD') AS Add_Dts
FROM dual
UNION
SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-2' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Second_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'C' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-03','YYYY-MM-DD')
FROM dual
),
GET_NORMALIZED_RANGES AS
( SELECT I.*,
COALESCE(
LEAD(Add_Dts) OVER (
PARTITION BY I.Original_Grain
ORDER BY I.Add_Dts), TO_DATE('9000-01-01')
) AS Next_Add_Dts
FROM INPUT I
),
GET_DISTINCT_ADD_DATES AS
( SELECT DISTINCT Add_Dts AS Driving_Date
FROM Input
),
NORMALIZED_EFFECTIVE_AT_EACH_POINT AS
( SELECT GNR.*,
GDAD.Driving_Date
FROM GET_NORMALIZED_RANGES GNR
INNER
JOIN GET_DISTINCT_ADD_DATES GDAD
ON GDAD.driving_date >= GNR.add_dts
AND GDAD.driving_Date < GNR.next_add_dts
),
PIVOT_EACH_POINT AS
( SELECT DISTINCT
Pivot_Grain,
Driving_Date,
MAX("'First_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS First_Col,
MAX("'Second_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Second_Col,
MAX("'Third_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Third_Col
FROM NORMALIZED_EFFECTIVE_AT_EACH_POINT NEP
PIVOT (MAX(Pivot_Attribute) FOR PIVOT_COLUMN IN ('First_Col','Second_Col','Third_Col'))
)
SELECT Pivot_Grain,
Driving_Date AS From_Dt,
COALESCE(LEAD(Driving_Date) OVER ( PARTITION BY pivot_grain ORDER BY Driving_Date),TO_DATE('9999-01-01')) AS To_Dt,
First_Col,
Second_Col,
Third_Col
FROM PIVOT_EACH_POINT
so the input can be written with the VALUES operator, and the column names put into the CTE definition, making it take up less space.
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-02'::date),
('OG-2', 'PG-1', 'Second_Col', 'A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-02'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-03'::date)
)
the LEAD in can be simplified by using the default value, which is an implicit COALESCE, but also some times if you have gaps in your data of this type IGNORE NULLS OVER is an awesome tool.
, get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
)
get_distinct_add_dates seems fine.
, get_distinct_add_dates AS (
SELECT DISTINCT add_dts AS driving_date
FROM input
)
depending on your data normalized_effective_at_each_point
will do it's name sake and give you a value at every time/date point, which will over slice the values of that are unrelated (I assume pivot_grain is some global thing id that is distinct data thus this input backs it)
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
at which point get_distinct_add_dates
should become:
, get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
)
INNER JOIN is a JOIN so we can skip the unneeded INNER
, normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
),
really pivot_each_point
is a three way JOIN, or a GROUP BY could be written, which the DISTINCT was really doing for us, and thus the PIVOT is gone.
, pivot_each_point AS (
SELECT Pivot_Grain
,Driving_Date
,MAX(IFF(pivot_column='First_Col', Pivot_Attribute, NULL)) as first_col
,MAX(IFF(pivot_column='Second_Col', Pivot_Attribute, NULL)) as second_col
,MAX(IFF(pivot_column='Third_Col', Pivot_Attribute, NULL)) as third_col
FROM normalized_effective_at_each_point
GROUP BY 1,2
)
and lastly the final lead can drop the COALESCE and be moved into pivot_each_point
.
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
), get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
), get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
), normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
)
SELECT pivot_grain
,driving_date
,LEAD(driving_date, 1, '9999-01-01'::date) OVER (PARTITION BY pivot_grain ORDER BY driving_date) AS to_dt
,MAX(IFF(pivot_column = 'First_Col', pivot_attribute, NULL)) AS first_col
,MAX(IFF(pivot_column = 'Second_Col', pivot_attribute, NULL)) AS second_col
,MAX(IFF(pivot_column = 'Third_Col', pivot_attribute, NULL)) AS third_col
FROM normalized_effective_at_each_point
GROUP BY pivot_grain, driving_date
ORDER BY pivot_grain, driving_date;
giving the result:
PIVOT_GRAIN DRIVING_DATE TO_DT FIRST_COL SECOND_COL THIRD_COL
PG-1 2020-01-01 2020-01-03 A A null
PG-1 2020-01-03 2020-01-05 B A C
PG-1 2020-01-05 9999-01-01 B A B
PG-2 2020-02-02 2020-02-04 D D null
PG-2 2020-02-04 2020-02-06 E D F
PG-2 2020-02-06 9999-01-01 E D D
I cannot help but think I have over mapped how I process data onto your PIVOT_GRAIN. And I started to trying solve this again from first principles now that I understand the code, and I think the first three processing CTE is how I would do it, thus the GROUP BY is also how I would do the rest, many JOIN seems really gross, and into Snowflake, I am more of a fan of this explode the data, and then MERGE (or GROUP BY) the data, as it's all nice and parallizable.