sql-servert-sqlsql-server-2017delta

Identifying changes over time


No doubt a similar question has come up before, but I haven't been able to locate it by searching...

I have a raw dataset with time series data including 'from' and 'to' date fields.

The problem is, when data is loaded, new records have been created ('to' date added to old record, new record 'from' load date) even where no values have changed. I want to convert this to a table which just shows a row for each genuine change - and the from/ to dates reflecting this.

By way of example, the source data looks like this:

ID Col1 Col2 Col3 From To
Test1 1 1 1 01/01/2020 31/12/9999
Test2 1 2 3 01/01/2020 30/06/2020
Test2 1 2 3 01/07/2020 30/09/2020
Test2 3 2 1 01/10/2020 31/12/9999

The first two records for Test2 (rows 2 and 3) are essentially the same - there was no change when the second row was loaded on 01/07/2020. I want a single row for the period 01/01/2020 - 30/09/2020 for which there was no change:

ID Col1 Col2 Col3 From To
Test1 1 1 1 01/01/2020 31/12/9999
Test2 1 2 3 01/01/2020 30/09/2020
Test2 3 2 1 01/10/2020 31/12/9999

For this simplified example, I can achieve that by grouping by each column (apart from dates) and using the MIN from date/ MAX end date:

SELECT
ID, Col1, Col2, Col3, MIN(From) AS From, MAX(To) as TO
FROM TABLE
GROUP BY ID, Col1, Col2, Col3

However, this won't work if a value changes then subsequently changes back to what it was before eg

ID Col1 Col2 Col3 From To
Test1 1 1 1 01/01/2020 31/12/9999
Test2 1 2 3 01/01/2020 30/04/2020
Test2 1 2 3 01/05/2020 30/06/2020
Test2 3 2 1 01/07/2020 30/10/2020
Test2 1 2 3 01/11/2020 31/12/9999

Simply using MIN/ MAX in the code above would return this - so it looks like both sets of values were valid for the period from 01/07/2020 - 30/10/2020:

ID Col1 Col2 Col3 From To
Test1 1 1 1 01/01/2020 31/12/9999
Test2 1 2 3 01/01/2020 31/12/9999
Test2 3 2 1 01/07/2020 30/10/2020

Whereas actually the first set of values were valid before and after that period, but not during. It should return a single row for instead of two for the period from 01/01/2020 - 30/06/2020 when there were no changes for this ID, but then another row for the period when the values were different, and then another row where it reverted to the initial values, but with a new From date.

ID Col1 Col2 Col3 From To
Test1 1 1 1 01/01/2020 31/12/9999
Test2 1 2 3 01/01/2020 30/06/2020
Test2 3 2 1 01/07/2020 30/10/2020
Test2 1 2 3 01/11/2020 31/12/9999

I'm struggling to conceptualise how to approach this. I'm guessing I need to use LAG somehow but not sure how to apply it - eg rank everything in a staging table first, then use LAG to compare a concatenation of the whole row?

I'm sure I could find a fudged way eventually, but I've no doubt this problem has been solved many times before so hoping somebody can point me to a simpler/ neater solution than I'd inevitably come up with...


Solution

  • Advanced Gaps and Islands

    I believe this is an advanced "gaps and islands" problem. Use that as a search term and you'll find plenty of literature on the subject. Only difference is normally only one column is being tracked, but you have 3.

    No Gaps Assumption

    One major assumption of this script is there is no gap in the overlapping dates, or in other words, it assumes the previous rows ToDate = current FromDate - 1 day. Not sure if you need to account for gaps, would be simple just add criteria to IsChanged to check for that

    Multi-Column Gaps and Islands Solution

    DROP TABLE IF EXISTS #Grouping
    DROP TABLE IF EXISTS #Test
    CREATE TABLE #Test (ID INT IDENTITY(1,1),TestName Varchar(10),Col1 INT,Col2 INT,Col3 INT,FromDate Date,ToDate DATE)
    
    INSERT INTO #Test VALUES
    ('Test1',1,1,1,'2020-01-01','9999-12-31')
    ,('Test2',1,2,3,'2020-01-01','2020-04-30')
    ,('Test2',1,2,3,'2020-05-01','2020-06-30')
    ,('Test2',3,2,1,'2020-07-01','2020-10-30')
    ,('Test2',1,2,3,'2020-11-01','9999-12-31')
    
    ;WITH cte_Prev AS (
        SELECT *
        ,PrevCol1 = LAG(Col1) OVER (PARTITION BY TestName       ORDER BY FromDate)
        ,PrevCol2 = LAG(Col2) OVER (PARTITION BY TestName       ORDER BY FromDate)
        ,PrevCol3 = LAG(Col3) OVER (PARTITION BY TestName       ORDER BY FromDate)
        FROM #Test
    ), cte_Compare AS (
        SELECT *
        ,IsChanged = CASE
            WHEN Col1 = PrevCol1
                AND Col2 = PrevCol2
                AND Col3 = PrevCol3
            THEN 0 /*No change*/
            ELSE 1 /*Iterate so new group created */
        END
        FROM cte_Prev
    )
    
    SELECT *,GroupID = SUM(IsChanged) OVER (PARTITION BY TestName ORDER BY ID)
    INTO #Grouping
    FROM cte_Compare
    
    /*Raw unformatted data so you can see how it works*/
    SELECT *
    FROM #Grouping
    
    /*Aggregated results*/
    SELECT GroupID,TestName,Col1,Col2,Col3
        ,FromDate = MIN(FromDate)
        ,ToDate = MAX(ToDate)
        ,NumberOfRowsCollapsedIntoOneRow = COUNT(*)
    FROM #Grouping
    GROUP BY GroupID,TestName,Col1,Col2,Col3