mysqlsqlstdev

MySQL Ignoring Outliers


I have to present some data to work colleagues and i am having issues analysing it in MySQL.

I have 1 table called 'payments'. Each payment has columns for:

  1. Client (our client e.g. a bank)
  2. Amount_gbp (the GBP equivalent of the value of the transaction)
  3. Currency
  4. Origin_country
  5. Client_type (individual or company)

I have written pretty simple queries like:

SELECT  
    AVG(amount_GBP), 
    COUNT(client) AS '#Of Results'
FROM payments

WHERE client_type = 'individual'
    AND amount_gbp IS NOT NULL
    AND currency = 'TRY'
    AND country_origin = 'GB'
    AND date_time BETWEEN '2017/1/1' AND '2017/9/1'

But what i really need to do is eliminate outliers from the average AND/OR only include results within a number of Standard Deviations from the Mean.

For example, ignore the top/bottom 10 results of 2% of results etc. AND/OR ignore any results that fall outside of 2 STDEVs from the Mean

Can anyone help?


Solution

  • --- EDITED ANSWER -- TRY AND LET ME KNOW ---

    Your best best is to create a TEMPORARY table with the avg and std_dev values and compare against them. Let me know if that is not feasible:

    CREATE TEMPORARY TABLE payment_stats AS
     SELECT
      AVG(p.amount_gbp) as avg_gbp,
      STDDEV(amount_gbp) as std_gbp,
      (SELECT MIN(srt.amount_gbp) as max_gbp
        FROM (SELECT amount_gbp
         FROM payments
         <... repeat where no p. ...>
         ORDER BY amount_gbp DESC
         LIMIT <top_numbers to ignore>
       ) srt
      ) max_g,
      (SELECT MAX(srt.amount_gbp) as min_gbp
        FROM (SELECT amount_gbp
         FROM payments
         <... repeat where no p. ...>
         ORDER BY amount_gbp ASC
         LIMIT <top_numbers to ignore>
       ) srt
      ) min_g
     FROM payments
     WHERE client_type = 'individual'
      AND amount_gbp IS NOT NULL
      AND currency = 'TRY'
      AND country_origin = 'GB'
      AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
    

    You can then compare against the temp table

    SELECT  
     AVG(p.amount_gbp) as avg_gbp, 
     COUNT(p.client) AS '#Of Results'
    FROM payments p
    WHERE
     p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2) 
                    FROM payment_stats)
     AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2) 
                    FROM payment_stats)
     AND p.amount_gbp > (SELECT min_g FROM payment_stats)
     AND p.amount_gbp < (SELECT max_g FROM payment_stats)
     AND p.client_type = 'individual'
     AND p.amount_gbp IS NOT NULL
     AND p.currency = 'TRY'
     AND p.country_origin = 'GB'
     AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
    

    -- Later on

    DROP TEMPORARY TABLE payment_stats;
    

    Notice I had to repeat the WHERE condition. Also change *2 to whatever <factor> to what you need!

    Still Phew!

    Each compare will check a different stat

    Let me know if this is better