postgresqlstatisticsoutliers

Implementing Sn and Qn (Rousseeuw and Croux) for outlier detection in Postgres


I'm reviewing and experimenting with outlier flagging strategies, and keep running into references to Sn and Qn from Rousseeuw and Croux in Alternatives to the Median Absolute Deviation.

http://web.ipac.caltech.edu/staff/fmasci/home/astro_refs/BetterThanMAD.pdf

They sound quite excellent, and seem to be widely used in academic and applied stats across disciplines. I checked Google Scholar, and that paper has over 2,100 citations.

The appealing feature of this technique is that it isn't heavily impacted by asymmetric distributions. Which is what we've got, most of the time. Sometimes quite extremely.

This is of course available in R, but I'm not a stats person, we don't have server-side access to R (or Python), and would like to do some searches directly in Postgres. I haven't been able to find anything in any SQL idiom, and am hoping that some stats lover out there has some Postgres code up their sleeve.


Solution

  • Now I know why people do this sort of work in R: Because R is fantastic for this kind of work. If anyone comes across this in the future, go get R. It's a compact, easy-to-use, easy-to-learn language with a great IDE.

    If you've got a Postgres server where you can install PL/R, so much the better. PL/R is written to use the DBI and RPostgreSQL R packages to connect with Postgres. Meaning, you should be able to develop your code in RStudio, and then add the bits of wrapping required to make it run in PL/R within your Postgres server.

    For outliers, I'm happy with univOutl (Univariate Outliers) so far, which provides 10 common, and less common, methods, including the Rousseeuw and Croux techniques.