sql-serverlogical-operatorsrolling-computation

Apply expanding calculations on SQL Server


I have this table in SQL server :

date var val
2022-2-1 A 1.1
2022-3-1 A 2.3
2022-4-1 A 1.5
2022-5-1 A 1.7
2022-09-1 B 1.8
2022-10-1 B 1.9
2022-11-1 B 2.1
2022-12-1 B 2.22

I want to group by column var and date and implement the expanding average and standard deviation in order to create an upper and lower interval of one standard deviation. At the end I want each entry in column val (apart from the first one) to be checked if falls inside the interval of the previous time (or lag 1).

How can I do it in SQL Server?

My attempt

-- Create the table
CREATE TABLE DataTable (
    [date] DATE,
    var CHAR(1),
    val DECIMAL(4, 2)
);

-- Insert the data
INSERT INTO DataTable ([date], var, val)
VALUES 
('2022-02-01', 'A', 1.1),
('2022-03-01', 'A', 2.3),
('2022-04-01', 'A', 1.5),
('2022-05-01', 'A', 1.7),
('2022-09-01', 'B', 1.8),
('2022-10-01', 'B', 1.9),
('2022-11-01', 'B', 2.1),
('2022-12-01', 'B', 2.22);


WITH DataWithStats AS (
    SELECT 
        [date],
        var,
        val,
        AVG(val) OVER (PARTITION BY var ORDER BY [date] 
                       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_avg,
        STDEV(val) OVER (PARTITION BY var ORDER BY [date] 
                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_std
    FROM DataTable
)
SELECT 
    [date],
    var,
    exp_avg - 2 * exp_std AS lower,
    val,
    exp_avg + 2 * exp_std AS upper,
    CASE 
        WHEN val < LAG(exp_avg - 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date]) OR  
             val > LAG(exp_avg + 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date]) 
        THEN 'Warning'
        ELSE 'ok'
    END AS status
FROM DataWithStats;

The problem with my attempt is that the check column does not properly evaluate the check.

https://sqlfiddle.com/sql-server/online-compiler?id=fe2e764f-9627-418f-bd08-dd13250b0dab


Solution

  • If you believe the "warning" status is incorrect in the query results, because 2.10 is not outside the displayed 1.62782795366962 - 2.23883804633038 range, then I believe the issue that your displayed values are not the ones used to calculate the status. The LAG() average and standard deviation values are used instead of the current values. The 2.10 value is outside the actual 1.7085786437627 - 1.9914213562373 range used by the test.

    The results were correct (as coded), but the presentation was misleading.

    Original results:

    date var lower val upper status
    2022-02-01 A null 1.10 null ok
    2022-03-01 A 0.00294372515228614 2.30 3.39705627484771 ok
    2022-04-01 A 0.411312814678444 1.50 2.85535318532156 ok
    2022-05-01 A 0.650000000000001 1.70 2.65 ok
    2022-09-01 B null 1.80 null ok
    2022-10-01 B 1.7085786437627 1.90 1.9914213562373 ok
    2022-11-01 B 1.62782795366962 2.10 2.23883804633038 Warning
    2022-12-01 B 1.62500000000001 2.22 2.38499999999999 ok

    Results showing actual range used by status calculation.

    date var lower val upper status
    2022-02-01 A null 1.10 null ok
    2022-03-01 A null 2.30 null ok
    2022-04-01 A 0.00294372515228614 1.50 3.39705627484771 ok
    2022-05-01 A 0.411312814678444 1.70 2.85535318532156 ok
    2022-09-01 B null 1.80 null ok
    2022-10-01 B null 1.90 null ok
    2022-11-01 B 1.7085786437627 2.10 1.9914213562373 Warning
    2022-12-01 B 1.62782795366962 2.22 2.23883804633038 ok

    Note that STDEV() will be null and no range will be calculated until there are at least two prior values.

    The following query modifies the window range in the CTE, so that LAG() is not needed in the final query. The displayed range limit values are the same as are used in the test.

    WITH DataWithStats AS (
        SELECT 
            [date],
            var,
            val,
            AVG(val) OVER (PARTITION BY var ORDER BY [date] 
                           ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_avg,
            STDEV(val) OVER (PARTITION BY var ORDER BY [date] 
                              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_std
        FROM DataTable
    )
    SELECT 
        [date],
        var,
        exp_avg - 2 * exp_std AS lower,
        val,
        exp_avg + 2 * exp_std AS upper,
        CASE 
            WHEN val < exp_avg - 2 * exp_std OR  
                 val > exp_avg + 2 * exp_std
            THEN 'Warning'
            ELSE 'ok'
        END AS status
    FROM DataWithStats;
    

    See this db<>fiddle.