The business requirement is to collapse consecutive periods into 1 row using SQL in Databricks. If a row's end date is one day before the next rows start date, then these 2 rows are considered to be the same period.
Here is a demo data set:
row | id | start_date | end_date |
---|---|---|---|
1 | a | 2024-05-03 | 2024-05-06 |
2 | a | 2024-05-07 | 2024-05-10 |
3 | a | 2024-05-13 | 2024-05-17 |
4 | a | 2024-05-18 | 2024-05-22 |
5 | a | 2024-05-23 | 2024-06-27 |
In this data set, row 1 and 2 should be of the same period, and so do row 3 to 5.
The desired output is:
row | id | start_date | end_date |
---|---|---|---|
1 | a | 2024-05-03 | 2024-05-10 |
2 | a | 2024-05-13 | 2024-06-27 |
I tried to use the lead and lag window functions to get previous rows' end date but this approach doesn't work if you have more than 2 rows of consecutive periods.
Thanks very much for your help
This is typically known as Gaps and Island problem. You need to create a dummy column to identify your continuous date range and then find the min and max values based on that column -
SELECT id, MIN(start_date) start_date, MAX(end_date) end_date
FROM (SELECT id, start_date, end_date, SUM(rn) OVER(PARTITION BY id ORDER BY start_date) sum_flag
FROM (SELECT id, start_date, end_date,
CASE WHEN start_date - 1 <> LAG(end_date) OVER(PARTITION BY id ORDER BY start_date)
THEN 1
ELSE 0
END rn
FROM demo
) T
) X
GROUP BY id, sum_flag;