routliersiqr

Excluding outliers based on multiple columns in R ? IQR method


I'm currently trying to exclude outliers based on a subset of selected variables with the aim of performing sensitivity analyses. I've adapted the function available here: calculating the outliers in R), but have been unsuccesful so far (I'm still a novice R user). Please let me know if you have any suggestions!

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
                 measure1 = rnorm(11, mean = 8, sd = 4),
                 measure2 = rnorm(11, mean = 40, sd = 5),
                 measure3 = rnorm(11, mean = 20, sd = 2),
                 measure4 = rnorm(11, mean = 9, sd = 3))

vars_of_interest <- c("measure1", "measure3", "measure4")

# define a function to remove outliers
FindOutliers <- function(data) {
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 3) + upperq
  extreme.threshold.lower = lowerq - (iqr * 3)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)
}

# use the function to identify outliers
temp <- FindOutliers(df[vars_of_interest])

# remove the outliers
testData <- testData[-temp]

# show the data with the outliers removed
testData

Solution

  • Separate the concerns:

    1. Identify outliers in a numeric vector using the IQR method. This can be encapsulated in a function taking a vector.
    2. Remove outliers from several columns of a data.frame. This is a function taking a data.frame.

    I would suggest returning a boolean vector rather than indices. This way, the returned value is the size of the data which makes it easy to create a new column, for exampledf$outlier <- is_outlier(df$measure1).

    Note how the argument names make it clear which type of input is expected: x is a standard name for a numeric vector and df is obviously a data.frame. cols is probably a list or vector of column names.

    I made a point to only use base R but in real life I would use the dplyr package to manipulate the data.frame.

    #' Detect outliers using IQR method
    #' 
    #' @param x A numeric vector
    #' @param na.rm Whether to exclude NAs when computing quantiles
    #' 
    is_outlier <- function(x, na.rm = FALSE) {
      qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)
    
      lowerq <- qs[1]
      upperq <- qs[2]
      iqr = upperq - lowerq 
    
      extreme.threshold.upper = (iqr * 3) + upperq
      extreme.threshold.lower = lowerq - (iqr * 3)
    
      # Return logical vector
      x > extreme.threshold.upper | x < extreme.threshold.lower
    }
    
    #' Remove rows with outliers in given columns
    #' 
    #' Any row with at least 1 outlier will be removed
    #' 
    #' @param df A data.frame
    #' @param cols Names of the columns of interest. Defaults to all columns.
    #' 
    #' 
    remove_outliers <- function(df, cols = names(df)) {
      for (col in cols) {
        cat("Removing outliers in column: ", col, " \n")
        df <- df[!is_outlier(df[[col]]),]
      }
      df
    }
    

    Armed with these 2 functions, it becomes very easy:

    df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
                     measure1 = rnorm(11, mean = 8, sd = 4),
                     measure2 = rnorm(11, mean = 40, sd = 5),
                     measure3 = rnorm(11, mean = 20, sd = 2),
                     measure4 = rnorm(11, mean = 9, sd = 3))
    
    vars_of_interest <- c("measure1", "measure3", "measure4")
    
    
    df_filtered <- remove_outliers(df, vars_of_interest)
    #> Removing outliers in column:  measure1  
    #> Removing outliers in column:  measure3  
    #> Removing outliers in column:  measure4
    
    df_filtered
    #>      ID  measure1 measure2 measure3   measure4
    #> 1  1001  9.127817 40.10590 17.69416  8.6031175
    #> 2  1002 18.196182 38.50589 23.65251  7.8630485
    #> 3  1003 10.537458 37.97222 21.83248  6.0798316
    #> 4  1004  5.590463 46.83458 21.75404  6.9589981
    #> 5  1005 14.079801 38.47557 20.93920 -0.6370596
    #> 6  1006  3.830089 37.19281 19.56507  6.2165156
    #> 7  1007 14.644766 37.09235 19.78774 10.5133674
    #> 8  1008  5.462400 41.02952 20.14375 13.5247993
    #> 9  1009  5.215756 37.65319 22.23384  7.3131715
    #> 10 1010 14.518045 48.97977 20.33128  9.9482211
    #> 11 1011  1.594353 44.09224 21.32434 11.1561089
    

    Created on 2020-03-23 by the reprex package (v0.3.0)