I have a table data as show below.
cust_id | city_type | city_name | start_date |
---|---|---|---|
1 | physical | Las Vegas | 5/17/2024 |
1 | office | Seattle | 5/17/2024 |
1 | office | Dallas | 9/20/2024 |
1 | physical | Dallas | 10/30/2024 |
1 | office | Austin | 2/10/2025 |
and I am expecting the below result. My data has around 1M customer_ids. Trying to implement SCD type 2 style.
cust_id | physical_city | office_city | start_date | end_date |
---|---|---|---|---|
1 | Las Vegas | Seattle | 5/17/2024 | 9/20/2024 |
1 | Las Vegas | Dallas | 9/20/2024 | 10/30/2024 |
1 | Dallas | Dallas | 10/30/2024 | 2/10/2025 |
1 | Dallas | Austin | 2/10/2025 | 12/31/9999 |
I have tried below query but I am missing something that I am not getting the required answer.
select
cust_id,
max(case when city_type='Physical' then city_name end) as physical_city,
max(case when city_type='Office' then city_name end) as office_city,
start_date,
coalesce(lead(start_date) over(partition by party_id order by start_date), '9999-12-31') as end_date
from table
group by cust_id, start_date
Thank you!
It is necessary to consider cases when the data one type ('office' or 'physical') change 2 or more times in a row sequence ordered by start_date. For example:
cust_id | city_type | city_name | start_date |
---|---|---|---|
2 | office | Seattle | 2024-05-17 |
2 | physical | Las Vegas | 2024-05-17 |
2 | office | Dallas | 2024-09-20 |
2 | office | Boston | 2024-09-21 |
2 | office | Seatle | 2024-09-22 |
2 | physical | Dallas | 2024-10-30 |
2 | office | Austin | 2025-02-10 |
After grouping (pivot) we have
cust_id | start_date | phisycal_city | office_city |
---|---|---|---|
2 | 2024-05-17 | Las Vegas | Seattle |
2 | 2024-09-20 | null | Dallas |
2 | 2024-09-21 | null | Boston |
2 | 2024-09-22 | null | Seatle |
2 | 2024-10-30 | Dallas | null |
2 | 2025-02-10 | null | Austin |
So, calculation's like lead(start_date...) not enough.
city_type
start_date - to_date.
We use there partition by cust_id,city_type
select *
,lead(start_date,1,cast('9999-12-31' as date))
over(partition by cust_id,city_type order by start_date) to_date
From test
Test data (expanded)
cust_id=1 - your example
cust_id=2 - 'office' city change twice
cust_id=3 - 'physical' city change twice
cust_id=4,5 - first rows for 'physical' and 'office' do nit have same date
See example
cust_id | city_type | city_name | start_date |
---|---|---|---|
1 | physical | Las Vegas | 2024-05-17 |
1 | office | Seattle | 2024-05-17 |
1 | office | Dallas | 2024-09-20 |
1 | physical | Dallas | 2024-10-30 |
1 | office | Austin | 2025-02-10 |
2 | office | Seattle | 2024-05-17 |
2 | physical | Las Vegas | 2024-05-17 |
2 | office | Dallas | 2024-09-20 |
2 | office | Boston | 2024-09-21 |
2 | physical | Dallas | 2024-10-30 |
2 | office | Austin | 2025-02-10 |
3 | office | Seattle | 2024-05-17 |
3 | physical | Las Vegas | 2024-05-17 |
3 | physical | Dallas | 2024-09-20 |
3 | office | Boston | 2024-09-21 |
3 | physical | Seatle | 2024-10-30 |
3 | physical | Austin | 2025-02-10 |
4 | office | Dallas | 2024-10-30 |
4 | physical | Austin | 2025-02-10 |
5 | physical | Dallas | 2024-10-30 |
5 | office | Austin | 2025-02-10 |
with ranged_data as(
select *
,lead(start_date,1,cast('9999-12-31' as date))
over(partition by cust_id,city_type order by start_date) to_date
From test
)
,agg_data as(
select cust_id,start_date
,max(case when city_type='physical' then city_name end) physical_city
,max(case when city_type='office' then city_name end) office_city
from test
group by cust_id,start_date
)
select ad.cust_id,ad.start_date
,lead(ad.start_date,1,cast('9999-12-31' as date))
over(partition by ad.cust_id order by ad.start_date) end_date
,coalesce(ad.physical_city,rd2.city_name) physical_city
,coalesce(ad.office_city,rd1.city_name) office_city
from agg_data ad
left join ranged_data rd1 on rd1.cust_id=ad.cust_id and rd1.city_type='office'
and rd1.start_date<ad.start_date and rd1.to_date>=ad.start_date
left join ranged_data rd2 on rd2.cust_id=ad.cust_id and rd2.city_type='physical'
and rd2.start_date<ad.start_date and rd2.to_date>=ad.start_date
order by ad.cust_id,ad.start_date
cust_id | start_date | end_date | physical_city | office_city |
---|---|---|---|---|
1 | 2024-05-17 | 2024-09-20 | Las Vegas | Seattle |
1 | 2024-09-20 | 2024-10-30 | Las Vegas | Dallas |
1 | 2024-10-30 | 2025-02-10 | Dallas | Dallas |
1 | 2025-02-10 | 9999-12-31 | Dallas | Austin |
2 | 2024-05-17 | 2024-09-20 | Las Vegas | Seattle |
2 | 2024-09-20 | 2024-09-21 | Las Vegas | Dallas |
2 | 2024-09-21 | 2024-10-30 | Las Vegas | Boston |
2 | 2024-10-30 | 2025-02-10 | Dallas | Boston |
2 | 2025-02-10 | 9999-12-31 | Dallas | Austin |
3 | 2024-05-17 | 2024-09-20 | Las Vegas | Seattle |
3 | 2024-09-20 | 2024-09-21 | Dallas | Seattle |
3 | 2024-09-21 | 2024-10-30 | Dallas | Boston |
3 | 2024-10-30 | 2025-02-10 | Seatle | Boston |
3 | 2025-02-10 | 9999-12-31 | Austin | Boston |
4 | 2024-10-30 | 2025-02-10 | null | Dallas |
4 | 2025-02-10 | 9999-12-31 | Austin | Dallas |
5 | 2024-10-30 | 2025-02-10 | Dallas | null |
5 | 2025-02-10 | 9999-12-31 | Dallas | Austin |