I have this table in SQL server :
date | var | val |
---|---|---|
2022-2-1 | A | 1.1 |
2022-3-1 | A | 2.3 |
2022-4-1 | A | 1.5 |
2022-5-1 | A | 1.7 |
2022-09-1 | B | 1.8 |
2022-10-1 | B | 1.9 |
2022-11-1 | B | 2.1 |
2022-12-1 | B | 2.22 |
I want to group by column var and date and implement the expanding average and standard deviation in order to create an upper and lower interval of one standard deviation. At the end I want each entry in column val (apart from the first one) to be checked if falls inside the interval of the previous time (or lag 1).
How can I do it in SQL Server?
My attempt
-- Create the table
CREATE TABLE DataTable (
[date] DATE,
var CHAR(1),
val DECIMAL(4, 2)
);
-- Insert the data
INSERT INTO DataTable ([date], var, val)
VALUES
('2022-02-01', 'A', 1.1),
('2022-03-01', 'A', 2.3),
('2022-04-01', 'A', 1.5),
('2022-05-01', 'A', 1.7),
('2022-09-01', 'B', 1.8),
('2022-10-01', 'B', 1.9),
('2022-11-01', 'B', 2.1),
('2022-12-01', 'B', 2.22);
WITH DataWithStats AS (
SELECT
[date],
var,
val,
AVG(val) OVER (PARTITION BY var ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_avg,
STDEV(val) OVER (PARTITION BY var ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_std
FROM DataTable
)
SELECT
[date],
var,
exp_avg - 2 * exp_std AS lower,
val,
exp_avg + 2 * exp_std AS upper,
CASE
WHEN val < LAG(exp_avg - 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date]) OR
val > LAG(exp_avg + 2 * exp_std, 1) OVER (PARTITION BY var ORDER BY [date])
THEN 'Warning'
ELSE 'ok'
END AS status
FROM DataWithStats;
The problem with my attempt is that the check column does not properly evaluate the check.
https://sqlfiddle.com/sql-server/online-compiler?id=fe2e764f-9627-418f-bd08-dd13250b0dab
If you believe the "warning" status is incorrect in the query results, because 2.10 is not outside the displayed 1.62782795366962 - 2.23883804633038 range, then I believe the issue that your displayed values are not the ones used to calculate the status. The LAG() average and standard deviation values are used instead of the current values. The 2.10 value is outside the actual 1.7085786437627 - 1.9914213562373 range used by the test.
The results were correct (as coded), but the presentation was misleading.
Original results:
date | var | lower | val | upper | status |
---|---|---|---|---|---|
2022-02-01 | A | null | 1.10 | null | ok |
2022-03-01 | A | 0.00294372515228614 | 2.30 | 3.39705627484771 | ok |
2022-04-01 | A | 0.411312814678444 | 1.50 | 2.85535318532156 | ok |
2022-05-01 | A | 0.650000000000001 | 1.70 | 2.65 | ok |
2022-09-01 | B | null | 1.80 | null | ok |
2022-10-01 | B | 1.7085786437627 | 1.90 | 1.9914213562373 | ok |
2022-11-01 | B | 1.62782795366962 | 2.10 | 2.23883804633038 | Warning |
2022-12-01 | B | 1.62500000000001 | 2.22 | 2.38499999999999 | ok |
Results showing actual range used by status calculation.
date | var | lower | val | upper | status |
---|---|---|---|---|---|
2022-02-01 | A | null | 1.10 | null | ok |
2022-03-01 | A | null | 2.30 | null | ok |
2022-04-01 | A | 0.00294372515228614 | 1.50 | 3.39705627484771 | ok |
2022-05-01 | A | 0.411312814678444 | 1.70 | 2.85535318532156 | ok |
2022-09-01 | B | null | 1.80 | null | ok |
2022-10-01 | B | null | 1.90 | null | ok |
2022-11-01 | B | 1.7085786437627 | 2.10 | 1.9914213562373 | Warning |
2022-12-01 | B | 1.62782795366962 | 2.22 | 2.23883804633038 | ok |
Note that STDEV()
will be null and no range will be calculated until there are at least two prior values.
The following query modifies the window range in the CTE, so that LAG()
is not needed in the final query. The displayed range limit values are the same as are used in the test.
WITH DataWithStats AS (
SELECT
[date],
var,
val,
AVG(val) OVER (PARTITION BY var ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_avg,
STDEV(val) OVER (PARTITION BY var ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS exp_std
FROM DataTable
)
SELECT
[date],
var,
exp_avg - 2 * exp_std AS lower,
val,
exp_avg + 2 * exp_std AS upper,
CASE
WHEN val < exp_avg - 2 * exp_std OR
val > exp_avg + 2 * exp_std
THEN 'Warning'
ELSE 'ok'
END AS status
FROM DataWithStats;
See this db<>fiddle.