sqlsql-servercommon-table-expressiongaps-and-islands

Group by columns when date ranges are contiguous


I have data in a format like this:

DECLARE @WidgetPrice TABLE (WidgetPriceId BIGINT IDENTITY(1,1), WidgitId INT, Price MONEY, 
    StartEffectiveWhen DATE, EndEffectiveWhen DATE)

INSERT INTO @WidgetPrice(WidgitId, Price, StartEffectiveWhen, EndEffectiveWhen)
VALUES
(100,      21.48, '2020-1-1',         '2021-8-5'),
(100,      19.34, '2021-8-6',         '2021-12-31'),
(100,      19.34, '2022-1-1',         '2022-12-31'),
(100,      19.34, '2023-1-1',         '2023-1-31'),
-- There is a date gap here (No price from 2023-1-31 to 2023-3-5)
(100,      19.34, '2023-3-5',         '2023-12-31'),
(100,      12.87, '2024-1-1',         '2024-1-31'),
(100,      12.87, '2024-2-1',         '2100-12-31'),
-- Next Widget's prices          
(200,      728.25, '2020-1-1',         '2021-12-31'),
(200,      728.25, '2022-1-1',         '2022-12-31'),
(200,      861.58, '2023-1-1',         '2024-5-21'),
(200,      601.19, '2024-5-22',        '2100-12-31')

I need to group by the WidgetId and Price but only when the dates are contiguous.

So, in my example data, there is a gap in the data between 2023-1-31 and 2023-3-5. Because there is a gap there, I need to have two entries for the price 19.34.

Here is an image of the data as I would hope to get it:

Desired Data

The key rows in this output are rows 2 and 3. It has the same price listed twice because there is a gap in the dates.

I had thought to make a recursive CTE that could look at the LAG value for the StartEffeciveWhen and EndEffectiveWhen, but I could not figure it out.

Any ideas on how I can build a query that does this?

NOTE: My actual data is over 113,000,000 rows with a lot more columns. I have just presented a simplified version for this question.

NOTE 2: I am running Microsoft SQL Server 2017


Solution

  • SQLFiddle: I put together a SQLFiddle so that you can run this query and adjust it to see different results. Very useful site: SQL Fiddle With Answer

    Strategy: Instead of focusing on chaining multiple rows togethers, you can focus on finding the edges. Then you can group by the running total of starting edges. I'm sure you could shorten this, but this gets the idea across:

    1. Use Lag Windowed Function on Price and Date to find changes.
    2. Use Case expression to define edges.
    3. Group by Sum Windowed function (Running total) of Starts to combine for min and max dates.
    --Find interesting edges
    SELECT 
        WidgitId
        , Price
        , MIN(StartEffectiveWhen) AS StartEffectiveWhen
        , MAX(EndEffectiveWhen) AS EndEffectiveWhen
    FROM (
        SELECT 
            SUM(i.StartingEdge) OVER (
                PARTITION BY WidgitId ORDER BY StartEffectiveWhen
                ) AS LeadingCount
            , WidgitId
            , Price
            , StartEffectiveWhen
            , EndEffectiveWhen
        FROM (
            SELECT 
                CASE 
                    WHEN LagPrice != Price
                        OR LagDate != DATEADD(day, - 1, StartEffectiveWhen)
                        THEN 1
                    ELSE 0
                    END AS StartingEdge
                , WidgitId
                , Price
                , StartEffectiveWhen
                , EndEffectiveWhen
            FROM (
                SELECT 
                    LAG(price) OVER (
                        PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                        ) AS LagPrice
                    , LAG(EndEffectiveWhen) OVER (
                        PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                        ) AS LagDate
                    , *
                FROM @WidgetPrice
            ) p
        ) i
    ) g
    GROUP BY g.Price,g.WidgitID,LeadingCount
    

    Output from Query:

    WidgitId Price StartEffectiveWhen EndEffectiveWhen
    100 12.87 1/1/2024 12/31/2100
    100 19.34 8/6/2021 1/31/2023
    100 19.34 3/5/2023 12/31/2023
    100 21.48 1/1/2020 8/5/2021
    200 601.19 5/22/2024 12/31/2100
    200 728.25 1/1/2020 12/31/2022
    200 861.58 1/1/2023 5/21/2024

    Alternative sub-query for those who might want to reduce the number of subqueries:

    --Find interesting edges
    SELECT WidgitId
        ,Price
        ,MIN(StartEffectiveWhen) AS StartEffectiveWhen
        ,MAX(EndEffectiveWhen) AS EndEffectiveWhen
    FROM (
        SELECT 
            SUM(i.StartingEdge) OVER (
                PARTITION BY WidgitId ORDER BY StartEffectiveWhen
                ) AS LeadingCount
            , WidgitId
            , Price
            , StartEffectiveWhen
            , EndEffectiveWhen
        FROM (
            SELECT 
                CASE 
                    WHEN LAG(price) OVER (
                                PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                                ) != Price
                        OR LAG(EndEffectiveWhen) OVER (
                                PARTITION BY WidgitID ORDER BY StartEffectiveWhen ASC
                                ) != DATEADD(day, - 1, StartEffectiveWhen)
                        THEN 1
                    ELSE 0
                    END AS StartingEdge
                , WidgitId
                , Price
                , StartEffectiveWhen
                , EndEffectiveWhen
            FROM @WidgetPrice
        ) i
    ) g
    GROUP BY g.Price, g.WidgitID, LeadingCount
    
    

    And here's a SQL Fiddle for the query with 1 less subquery