rggplot2statisticsboxplotoutliers

How do I easily find boxplot outliers


Below is an example using the mtcars dataset. There is one outlier with a value of 33.9, but I want a function that finds all of them for a given column.

library(dplyr)
library(ggplot2)

mtcars %>%
  ggplot(aes(x = "", y = mpg)) +
  geom_boxplot(fill = "#2645df")

I do not know the formula for boxplot whisker limits, so I used the plot above to find that value and then changed it manually:

res = ifelse(mtcars$mpg > 33, "outlier", "not outlier")
res = ifelse(mtcars$mpg < 10, "outlier", "not outlier")

This approach is both inefficient, and incorrect: 33 is not the lower limit for outliers, neither is 10.


Solution

  • I was able to achieve my desired output. Using the formula for boxplot outliers I was able to make two neat functions that not only serve the desired purpose, but also work within the tidyverse semantics:

    # smaller function to find the boxplot wisker limits:
    
    outlierLimits = function(x, extreme = F){
      qts = quantile(x, c(.25, .75), names = F)
      
      IQR = qts[2] - qts[1]
      
      ret = c(
        
        lower = qts[1] - IQR*1.5,
        upper = qts[2] + IQR*1.5,
        lower.extreme = qts[1] - IQR*3,
        upper.extreme = qts[2] + IQR*3
        
      )[c(T, T, extreme, extreme)]
      
      return(ret)
    }
    
    # The function I was looking for:
    
    outlierClassify = function(x, extreme = F,
                               labels = c("regular", "outlier",
                                          "extreme")[c(T,T,extreme)]){
      lims = outlierLimits( x, extreme )
      
      ret = ifelse(x > lims[1] & x < lims[2],
                   labels[1], labels[2])
      
      if(extreme){
        ret[ ret != labels[1] ] = ifelse(
          
          x[ ret != labels[1] ] > lims[3] & 
            x[ ret != labels[1] ] < lims[4],
          
          labels[2], labels[3]
        )
      }
      return(ret)
    }
    

    This way, the outlierClassify function returns a character vector that relates to the input vector x.

    Some great use case examples are:

    # simply obtaining the resulted vector
    
    outlierClassify(mtcars$mpg, F)
    
    # using it with mutate()
    library(dplyr)
    
    test = mtcars %>%
      select(mpg, cyl) %>%
      mutate(car = rownames(mtcars),
             .before = 1) %>% 
    
      # added an 'extreme' oulier for examplification
      rbind(data.frame(
        car = "UNO Mille", mpg = 34, cyl = 6
      )) %>% 
      group_by(cyl) %>% 
      mutate(outliers = outlierClassify(mpg, T),
             .after = mpg)
    
    # using it with ggplot
    library(ggplot2)
    
    test %>% 
      ggplot(aes(x = as.factor(cyl), y = mpg))+
      geom_boxplot(outlier.shape = NA, fill = "#2645df", alpha = .6)+
      geom_jitter(aes(color = outliers), width = .1)+
      #making it pretty
      scale_color_manual(values = c("red", "darkorange", "black"))+
      theme_minimal()+
      theme(plot.background = element_rect(fill = "wheat3"))