sqlsql-serverwindow-functionschange-data-capture

How to Handle Rows with and without Status Changes in History Table


I have a table in SQL Server that tracks the status of different IDs over time. The table includes both historical data and current data, and I need to write a query that will return the correct status and date ranges for each ID, whether or not the status has changed.

The table structure is as follows:

CREATE TABLE #history_Card_status (
    HistoryData BIT,
    StatusUpdateDate DATETIME,
    CreateDate DATETIME,
    Id BIGINT,
    Status NVARCHAR(50)
);

INSERT INTO #history_Card_status (HistoryData, StatusUpdateDate, CreateDate, Id, Status)
VALUES
-- Current Data
(0, NULL, '2023-06-13 05:07:38.700', 222, 'Open'),
(0, NULL, '2021-07-16 00:44:46.740', 111, 'Closed'),
-- Historical Data
(1, '2024-08-20 21:10:57.093', '2021-07-16 00:44:46.740', 111, 'Closed'),
(1, '2024-07-05 13:22:04.220', '2021-07-16 00:44:46.740', 111, 'Closed'),
(1, '2024-07-05 13:13:02.133', '2021-07-16 00:44:46.740', 111, 'Inactive/Block'),
(1, '2024-07-02 03:01:12.467', '2021-07-16 00:44:46.740', 111, 'Inactive/Block'),
(1, '2024-06-24 02:12:00.773', '2021-07-16 00:44:46.740', 111, 'Inactive/Block'),
(1, '2024-06-24 02:12:00.687', '2021-07-16 00:44:46.740', 111, 'Inactive/Block'),
(1, '2024-06-24 02:00:46.040', '2021-07-16 00:44:46.740', 111, 'Open'),
(1, '2024-06-24 02:00:44.303', '2021-07-16 00:44:46.740', 111, 'Open'),
(1, '2024-04-14 11:57:21.133', '2021-07-16 00:44:46.740', 111, 'Open'),
(1, '2024-04-14 11:52:09.073', '2021-07-16 00:44:46.740', 111, 'Open'),
(1, '2024-07-01 04:28:08.213', '2023-06-13 05:07:38.700', 222, 'Open'),
(1, '2024-07-01 03:39:54.607', '2023-06-13 05:07:38.700', 222, 'Open'),
(1, '2024-04-24 11:18:59.380', '2023-06-13 05:07:38.700', 222, 'Open'),
(1, '2024-04-24 11:18:59.227', '2023-06-13 05:07:38.700', 222, 'Open');

This table contains both historical records (marked with HistoryData = 1) and current records (marked with HistoryData = 0). The StatusUpdateDate column is filled only for historical records, while the CreateDate column is present in all records.

Problem

I need to create a query that returns the status and the corresponding date ranges for each ID. The query should handle both cases where the status changes and cases where it does not. The expected result should look like this:

StatusUpdateDate         StatusTillDate          Id      Status
-----------------------  ----------------------  ------  -----------
2024-07-05 13:13:02.133  2024-08-24 08:49:31.233  111     Closed
2024-06-24 02:00:46.040  2024-07-05 13:13:02.133  111     Inactive/Block
2021-07-16 00:44:46.740  2024-06-24 02:00:46.040  111     Open
2023-06-13 05:07:38.700  2024-08-24 08:49:31.233  222     Open

My Attempt

I started with a query that works for cases where status changes occur:

WITH Step1 AS (
    SELECT 
        StatusUpdateDate = ISNULL(StatusUpdateDate, ISNULL(history.CreateDate, '1970-01-01 00:00:00.000')),
        Id,
        Status,
        lag_status = LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), 
        ChangeFlag = CASE 
            WHEN Status <> LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC) 
            THEN 1 ELSE 0 END 
    FROM #history_Card_status AS history
)
SELECT 
    StatusUpdateDate,
    StatusTillDate = ISNULL(LAG(StatusUpdateDate) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), DATEADD(DAY, 2, GETDATE())),
    Id,
    Status = ISNULL(LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), lag_status)
FROM Step1
WHERE ChangeFlag = 1;

This query works well for IDs where a status change has occurred, but I only see IDs with changes, like this:

StatusUpdateDate         StatusTillDate          Id      Status
-----------------------  ----------------------  ------  -----------
2024-07-05 13:13:02.133  2024-08-24 06:22:39.003  111     Closed
2024-06-24 02:00:46.040  2024-07-05 13:13:02.133  111     Inactive/Block
2021-07-16 00:44:46.740  2024-06-24 02:00:46.040  111     Open

Then, I tried to handle the rows where changes did not take place:

WITH Step1 AS (
    SELECT 
        StatusUpdateDate = ISNULL(StatusUpdateDate, ISNULL(history.CreateDate, '1970-01-01 00:00:00.000')),
        Id,
        Status,
        lag_status = LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), 
        ChangeFlag = CASE 
            WHEN Status <> LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC) 
            THEN 1 ELSE 0 END,
        rownum = ROW_NUMBER() OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC) 
    FROM #history_Card_status AS history
)
SELECT 
    StatusUpdateDate,
    StatusTillDate = ISNULL(LAG(StatusUpdateDate) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), DATEADD(DAY, 2, GETDATE())),
    Id,
    ChangeFlag,
    Status = ISNULL(LAG(Status) OVER (PARTITION BY Id ORDER BY StatusUpdateDate DESC), lag_status),
    rownum
FROM Step1
WHERE ChangeFlag = 1
OR (ChangeFlag = 0 AND rownum = 1); 

This attempt, however, returns disappointing results like:

StatusUpdateDate         StatusTillDate          Id      ChangeFlag  Status
-----------------------  ----------------------  ------  ----------- -----------
2024-08-20 21:10:57.093  2024-08-24 16:44:22.340  111     0           NULL
2024-07-05 13:13:02.133  2024-08-20 21:10:57.093  111     1           Closed
2024-06-24 02:00:46.040  2024-07-05 13:13:02.133  111     1           Inactive/Block
2021-07-16 00:44:46.740  2024-06-24 02:00:46.040  111     1           Open
2024-07-01 04:28:08.213  2024-08-24 16:44:22.340  222     0           NULL

What I'm Looking For

I'm looking for a way to return the correct status and date ranges for each ID, including those without any status changes, as shown in the expected results.


Solution

  • This looks like a standard gaps and island, with some small wrinkles, like the start date:

    SELECT
        MIN(CASE WHEN groupflag = 0 THEN CreateDate ELSE StatusUpdateDate END) AS minDate
        , MAX(ISNULL(StatusUpdateDate, DATEADD(DAY, 2, GETDATE()))) AS maxDate
        , ID, status, groupFlag
    FROM (
        SELECT *
            , SUM(flag) OVER(partition BY ID ORDER BY historydata DESC, statusupdatedate rows BETWEEN unbounded preceding AND CURRENT row) AS groupFlag
        FROM (
            SELECT
                CASE WHEN status <> lag(status) OVER(partition BY id ORDER BY historydata DESC, statusupdatedate) THEN 1 ELSE 0 END AS flag
                , *
            FROM #history_Card_status h
        ) x
    ) x
    GROUP BY groupflag, id, status
    ORDER BY id, groupflag DESC
    

    I usually use the CASE WHEN something <> lag(something) then 1 else 0 end which takes cares of the first row.

    Then, we summarize the above flag to create the grouping. For performance I use ROWS BETWEEN, otherwise it defaults to RANGE which works a little slower.

    Finally, I take the CreateDate for the first row to create the anchor date, and StatusUpdateDate for the others.

    Output:

    minDate maxDate ID status groupFlag
    2024-07-05 13:22:04.220 2024-08-25 00:04:20.760 111 Closed 2
    2024-06-24 02:12:00.687 2024-07-05 13:13:02.133 111 Inactive/Block 1
    2021-07-16 00:44:46.740 2024-06-24 02:00:46.040 111 Open 0
    2023-06-13 05:07:38.700 2024-08-25 00:04:20.760 222 Open 0