I have data in a format like this:
DECLARE @WidgetPrice TABLE (WidgetPriceId BIGINT IDENTITY(1,1), WidgitId INT, Price MONEY,
StartEffectiveWhen DATE, EndEffectiveWhen DATE)
INSERT INTO @WidgetPrice(WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen)
VALUES
(100, 21.48, '2020-1-1', '2021-8-5'),
(100, 19.34, '2021-8-6', '2021-12-31'),
(100, 19.34, '2022-1-1', '2022-12-31'),
(100, 19.34, '2023-1-1', '2023-1-31'),
-- There is a date gap here (No price from 2023-1-31 to 2023-3-5)
(100, 19.34, '2023-3-5', '2023-12-31'),
(100, 12.87, '2024-1-1', '2024-1-31'),
(100, 12.87, '2024-2-1', '2100-12-31'),
-- Next Widget's prices
(200, 728.25, '2020-1-1', '2021-12-31'),
(200, 728.25, '2022-1-1', '2022-12-31'),
(200, 861.58, '2023-1-1', '2024-5-21'),
(200, 601.19, '2024-5-22', '2100-12-31')
I need to group by the WidgetId
and Price
but only when the dates are contiguous.
So, in my example data, there is a gap in the data between 2023-1-31 and 2023-3-5. Because there is a gap there, I need to have two entries for the price 19.34.
Here is an image of the data as I would hope to get it:
The key rows in this output are rows 2 and 3. It has the same price listed twice because there is a gap in the dates.
I had thought to make a recursive CTE that could look at the LAG
value for the StartEffeciveWhen
and EndEffectiveWhen
, but I could not figure it out.
Any ideas on how I can build a query that does this?
NOTE: My actual data is over 113,000,000 rows with a lot more columns. I have just presented a simplified version for this question.
NOTE 2: I am running Microsoft SQL Server 2017
SQLFiddle: I put together a SQLFiddle so that you can run this query and adjust it to see different results. Very useful site: SQL Fiddle With Answer
Strategy: Instead of focusing on chaining multiple rows togethers, you can focus on finding the edges. Then you can group by the running total of starting edges. I'm sure you could shorten this, but this gets the idea across:
--Find interesting edges
SELECT
WidgitId
, Price
, MIN(StartEffectiveWhen) AS StartEffectiveWhen
, MAX(EndEffectiveWhen) AS EndEffectiveWhen
FROM (
SELECT
SUM(i.StartingEdge) OVER (
PARTITION BY WidgitId ORDER BY StartEffectiveWhen
) AS LeadingCount
, WidgitId
, Price
, StartEffectiveWhen
, EndEffectiveWhen
FROM (
SELECT
CASE
WHEN LagPrice != Price
OR LagDate != DATEADD(day, - 1, StartEffectiveWhen)
THEN 1
ELSE 0
END AS StartingEdge
, WidgitId
, Price
, StartEffectiveWhen
, EndEffectiveWhen
FROM (
SELECT
LAG(price) OVER (
PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
) AS LagPrice
, LAG(EndEffectiveWhen) OVER (
PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
) AS LagDate
, *
FROM @WidgetPrice
) p
) i
) g
GROUP BY g.Price,g.WidgitID,LeadingCount
Output from Query:
WidgitId | Price | StartEffectiveWhen | EndEffectiveWhen |
---|---|---|---|
100 | 12.87 | 1/1/2024 | 12/31/2100 |
100 | 19.34 | 8/6/2021 | 1/31/2023 |
100 | 19.34 | 3/5/2023 | 12/31/2023 |
100 | 21.48 | 1/1/2020 | 8/5/2021 |
200 | 601.19 | 5/22/2024 | 12/31/2100 |
200 | 728.25 | 1/1/2020 | 12/31/2022 |
200 | 861.58 | 1/1/2023 | 5/21/2024 |
Alternative sub-query for those who might want to reduce the number of subqueries:
--Find interesting edges
SELECT WidgitId
,Price
,MIN(StartEffectiveWhen) AS StartEffectiveWhen
,MAX(EndEffectiveWhen) AS EndEffectiveWhen
FROM (
SELECT
SUM(i.StartingEdge) OVER (
PARTITION BY WidgitId ORDER BY StartEffectiveWhen
) AS LeadingCount
, WidgitId
, Price
, StartEffectiveWhen
, EndEffectiveWhen
FROM (
SELECT
CASE
WHEN LAG(price) OVER (
PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
) != Price
OR LAG(EndEffectiveWhen) OVER (
PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
) != DATEADD(day, - 1, StartEffectiveWhen)
THEN 1
ELSE 0
END AS StartingEdge
, WidgitId
, Price
, StartEffectiveWhen
, EndEffectiveWhen
FROM @WidgetPrice
) i
) g
GROUP BY g.Price, g.WidgitID, LeadingCount
And here's a SQL Fiddle for the query with 1 less subquery