sqlsnowflake-cloud-data-platformwindow-functions

SQL Window Functions - Pivot on a Column


I have a table data as show below.

cust_id city_type city_name start_date
1 physical Las Vegas 5/17/2024
1 office Seattle 5/17/2024
1 office Dallas 9/20/2024
1 physical Dallas 10/30/2024
1 office Austin 2/10/2025

and I am expecting the below result. My data has around 1M customer_ids. Trying to implement SCD type 2 style.

cust_id physical_city office_city start_date end_date
1 Las Vegas Seattle 5/17/2024 9/20/2024
1 Las Vegas Dallas 9/20/2024 10/30/2024
1 Dallas Dallas 10/30/2024 2/10/2025
1 Dallas Austin 2/10/2025 12/31/9999

I have tried below query but I am missing something that I am not getting the required answer.

select 
  cust_id,
  max(case when city_type='Physical' then city_name end) as physical_city,
  max(case when city_type='Office' then city_name end) as office_city,
  start_date,
  coalesce(lead(start_date) over(partition by party_id order by start_date), '9999-12-31') as end_date
from table
group by cust_id, start_date

Thank you!


Solution

  • It is necessary to consider cases when the data one type ('office' or 'physical') change 2 or more times in a row sequence ordered by start_date. For example:

    cust_id city_type city_name start_date
    2 office Seattle 2024-05-17
    2 physical Las Vegas 2024-05-17
    2 office Dallas 2024-09-20
    2 office Boston 2024-09-21
    2 office Seatle 2024-09-22
    2 physical Dallas 2024-10-30
    2 office Austin 2025-02-10

    After grouping (pivot) we have

    cust_id start_date phisycal_city office_city
    2 2024-05-17 Las Vegas Seattle
    2 2024-09-20 null Dallas
    2 2024-09-21 null Boston
    2 2024-09-22 null Seatle
    2 2024-10-30 Dallas null
    2 2025-02-10 null Austin

    So, calculation's like lead(start_date...) not enough.

    1. Take ranges for every city_type start_date - to_date. We use there partition by cust_id,city_type
    select *
      ,lead(start_date,1,cast('9999-12-31' as date))
        over(partition by cust_id,city_type order by start_date) to_date
    From test
    
    1. Group data by start_date (pivot). We take city_name or null for this date.
    2. JOIN ranged_data to cover null's

    Test data (expanded)
    cust_id=1 - your example
    cust_id=2 - 'office' city change twice
    cust_id=3 - 'physical' city change twice
    cust_id=4,5 - first rows for 'physical' and 'office' do nit have same date

    See example

    cust_id city_type city_name start_date
    1 physical Las Vegas 2024-05-17
    1 office Seattle 2024-05-17
    1 office Dallas 2024-09-20
    1 physical Dallas 2024-10-30
    1 office Austin 2025-02-10
    2 office Seattle 2024-05-17
    2 physical Las Vegas 2024-05-17
    2 office Dallas 2024-09-20
    2 office Boston 2024-09-21
    2 physical Dallas 2024-10-30
    2 office Austin 2025-02-10
    3 office Seattle 2024-05-17
    3 physical Las Vegas 2024-05-17
    3 physical Dallas 2024-09-20
    3 office Boston 2024-09-21
    3 physical Seatle 2024-10-30
    3 physical Austin 2025-02-10
    4 office Dallas 2024-10-30
    4 physical Austin 2025-02-10
    5 physical Dallas 2024-10-30
    5 office Austin 2025-02-10
    with ranged_data as(
    select *
      ,lead(start_date,1,cast('9999-12-31' as date))
        over(partition by cust_id,city_type order by start_date) to_date
    From test
    )
    ,agg_data as(
    select cust_id,start_date
      ,max(case when city_type='physical' then city_name end) physical_city
      ,max(case when city_type='office' then city_name end) office_city
    from test
    group by cust_id,start_date
    )
    select ad.cust_id,ad.start_date
      ,lead(ad.start_date,1,cast('9999-12-31' as date))
        over(partition by ad.cust_id order by ad.start_date) end_date
      ,coalesce(ad.physical_city,rd2.city_name) physical_city
      ,coalesce(ad.office_city,rd1.city_name) office_city
    from agg_data ad
    left join ranged_data rd1 on rd1.cust_id=ad.cust_id and rd1.city_type='office'
          and  rd1.start_date<ad.start_date and rd1.to_date>=ad.start_date
    left join ranged_data rd2 on rd2.cust_id=ad.cust_id and rd2.city_type='physical'
          and  rd2.start_date<ad.start_date and rd2.to_date>=ad.start_date
    order by ad.cust_id,ad.start_date
    
    cust_id start_date end_date physical_city office_city
    1 2024-05-17 2024-09-20 Las Vegas Seattle
    1 2024-09-20 2024-10-30 Las Vegas Dallas
    1 2024-10-30 2025-02-10 Dallas Dallas
    1 2025-02-10 9999-12-31 Dallas Austin
    2 2024-05-17 2024-09-20 Las Vegas Seattle
    2 2024-09-20 2024-09-21 Las Vegas Dallas
    2 2024-09-21 2024-10-30 Las Vegas Boston
    2 2024-10-30 2025-02-10 Dallas Boston
    2 2025-02-10 9999-12-31 Dallas Austin
    3 2024-05-17 2024-09-20 Las Vegas Seattle
    3 2024-09-20 2024-09-21 Dallas Seattle
    3 2024-09-21 2024-10-30 Dallas Boston
    3 2024-10-30 2025-02-10 Seatle Boston
    3 2025-02-10 9999-12-31 Austin Boston
    4 2024-10-30 2025-02-10 null Dallas
    4 2025-02-10 9999-12-31 Austin Dallas
    5 2024-10-30 2025-02-10 Dallas null
    5 2025-02-10 9999-12-31 Dallas Austin

    fiddle