sqlsql-servert-sqlssmsstdev

How to remove outliers in SQL using Standard Deviation?


I currently have a dataset that contains transactional information, including dates, an amount field, and descriptive fields. I am looking to do simple statistical analysis (mean, median, etc.), but need to remove outliers first. The issue I'm having is removing the outliers while grouping the data by Region, City, and Date. I'm sure this could be done more easily using a robust programming language (R, Python, etc.), but I have to use SQL (SSMS) for this task.

Simple dataset example:

| Region | City    | Date       | Amount |
| ------ | ------- | ---------- | ------ |
| SW     | Phoenix | 2021-10-01 | 400    |
| NE     | Boston  | 2021-10-03 | 20     |
| SW     | Phoenix | 2021-10-03 | 800    |
| SW     | Phoenix | 2021-10-02 | 425    |
| NE     | Boston  | 2021-10-01 | 500    |
| SW     | Phoenix | 2021-10-02 | 15     |
| SW     | Phoenix | 2021-10-04 | 100    |
| NE     | Boston  | 2021-10-04 | 35     |
| SE     | Orlando | 2021-10-02 | 300    |

Initially I wanted to use IQR method, but I'm thinking using standard deviation (STDEV) will be easier. However, I'm running into issues with the grouping once I include the STDEV code.

Here is the current code as I have it:

WITH CTE_data AS (
SELECT 
     Region
    ,City
    ,Date
    ,Amount
FROM OrderTable
)

SELECT 
     Region
    ,City
    ,MAX(Date) AS MaxDate
    ,MIN(Date) AS MinDate
    ,AVG(Amount) AS AvgAmt
    ,STDEV(Amount) AS StedvAmt
FROM CTE_data
GROUP BY Region, City 

Given the sample dataset, I'd like to ignore the amounts that fall outside of ((Amount > (AvgAmt - StedvAmt) AND Amount < (AvgAmt + StedvAmt)). My end goal is to have an average amount for each Region/City, as well as an average amount (excluding the outliers). My desired output would be:

| Region | City    | MinDate    | MaxDate    | AvgAmt | AvgAmt_rem |
| ------ | ------- | ---------- | ---------- | ------ | ---------- |
| SW     | Phoenix | 2021-10-01 | 2021-10-04 | 348    | 308.33     |
| NE     | Boston  | 2021-10-01 | 2021-10-04 | 185    | 27.5       |
| SE     | Orlando | 2021-10-02 | 2021-10-02 | 300    | 300        |

Please note: for simplicity I didn't do the actual standard deviation calculation for my desired output, I just removed the outliers based on looking at the values (800 & 15 for Phoenix and 500 for Boston).


Solution

  • It's not necessary to use a self-join. You can do this with a single scane of the base table, using window functions

    WITH cteStats as ( --Calculate the Avg & Std of the raw data
        SELECT 
             *
            ,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
            ,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
        FROM CTE_data
    )
    SELECT 
             S.Region
            ,S.City
            ,MIN(S.ReadingDate) AS MinDate
            ,MAX(S.ReadingDate) AS MaxDate
            ,AVG(S.Amount) AS AvgFiltered
            ,STDEV(S.Amount) AS StedvFiltered
    FROM cteStats as S
    WHERE s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
    GROUP BY s.Region, S.City;
    

    db<>fiddle

    You can also show both with and without the filtered rows, by using conditional aggregation

    WITH cteStats as ( --Calculate the Avg & Std of the raw data
        SELECT 
             *
            ,AVG(Amount) OVER (PARTITION BY Region, City) AS AvgRaw
            ,STDEV(Amount) OVER (PARTITION BY Region, City) AS StedRaw
        FROM CTE_data
    )
    SELECT 
             S.Region
            ,S.City
            ,MIN(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
                   THEN S.ReadingDate END) AS MinDate
            ,MAX(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
                   THEN S.ReadingDate END) AS MaxDate
            ,AVG(CASE WHEN s.Amount BETWEEN S.AvgRaw - S.StedRaw AND S.AvgRaw + S.StedRaw 
                   THEN S.Amount END) AS AvgFiltered
            ,AVG(S.Amount) AS AvgAll
    FROM cteStats as S
    GROUP BY s.Region, S.City;
    

    db<>fiddle