rexcelmatrixoutliers

Identifying outliers with differnt numbers of data points per sample in Execl or R


I am trying to identify Outiers in a Dataset, where each sample has a different amount of Data points. Meaning the Matrix length differs from sample to sample. Since I am working on a big data set it is really Time consuming to change matrix length manually. I hope to find a solution for that problem here:) Also suggestions for R are welcome!

Here is an example data set.

df1 <-
  structure(list(
    Time = c(5, 17, 18, 24, 4, 8, 9, 10, 16, 15, 4, 13, 9, 17), 
    ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), 
    row.names = c(NA, -14L), 
    class = c("tbl_df", "tbl", "data.frame" ))

I used the Function:

=IF(OR(B2>QUARTILE($B$2:$B$5;3)+1,5*(QUARTILE($B$2:$B$5;3)-QUARTILE($B$2:$B$5;1));F2<QUARTILE($B$2:$B$51;1)-1,5*(QUARTILE($B$2:$B$5;3)-QUARTILE($B$2:$B$5;1)));1;0)  

But as described above, with that approach I need to change the matrix manually from sample to sample.
The outlier is then identified with "1".


Solution

  • Here are two ways, with quantile/IQR and with the statistics boxplot uses. The latter is fastest.

    outlier <- function(x, na.rm = FALSE) {
      qq <- quantile(x, c(1/4, 3/4), na.rm = na.rm)
      iqr <- IQR(x, na.rm = na.rm)
      x < qq[1L] - 1.5*iqr | x > qq[2L] + 1.5*iqr
    }
    outlier_bp <- function(x, na.rm = FALSE) {
      x %in% boxplot.stats(x)$out
    }
    
    sapply(iris[1:4], outlier) |> head()
    #>      Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> [1,]        FALSE       FALSE        FALSE       FALSE
    #> [2,]        FALSE       FALSE        FALSE       FALSE
    #> [3,]        FALSE       FALSE        FALSE       FALSE
    #> [4,]        FALSE       FALSE        FALSE       FALSE
    #> [5,]        FALSE       FALSE        FALSE       FALSE
    #> [6,]        FALSE       FALSE        FALSE       FALSE
    sapply(iris[1:4], outlier_bp) |> head()
    #>      Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> [1,]        FALSE       FALSE        FALSE       FALSE
    #> [2,]        FALSE       FALSE        FALSE       FALSE
    #> [3,]        FALSE       FALSE        FALSE       FALSE
    #> [4,]        FALSE       FALSE        FALSE       FALSE
    #> [5,]        FALSE       FALSE        FALSE       FALSE
    #> [6,]        FALSE       FALSE        FALSE       FALSE
    
    library(microbenchmark)
    
    mb <- microbenchmark(
      qnt = sapply(iris[1:4], outlier),
      bxp = sapply(iris[1:4], outlier_bp)
    )
    print(mb, order = "median")
    #> Unit: microseconds
    #>  expr   min     lq    mean median     uq    max neval cld
    #>   bxp 233.3 321.15 388.300 365.55 454.15  738.9   100   b
    #>   qnt 551.2 742.65 875.958 817.95 949.30 3593.9   100  a
    

    Created on 2023-09-25 with reprex v2.0.2


    Edit

    After the OP comments, here is a better use case for the function outlier above.
    Split Time by ID and apply the function to each sub-vector. This is done with tapply.

    i_out <- with(df1, tapply(Time, ID, FUN = outlier))
    i_out
    #> $`1`
    #> [1]  TRUE FALSE FALSE FALSE
    #> 
    #> $`2`
    #> [1] FALSE FALSE FALSE
    #> 
    #> $`3`
    #> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    
    out <- with(df1, tapply(Time, ID, FUN = \(x) x[outlier(x)]))
    out
    #> $`1`
    #> [1] 5
    #> 
    #> $`2`
    #> numeric(0)
    #> 
    #> $`3`
    #> numeric(0)
    

    Created on 2023-09-30 with reprex v2.0.2


    Or with a function that returns the outliers, not an index to them,

    outlier2 <- function(x, na.rm = FALSE) {
      qq <- quantile(x, c(1/4, 3/4), na.rm = na.rm)
      iqr <- IQR(x, na.rm = na.rm)
      i <- x < qq[1L] - 1.5*iqr | x > qq[2L] + 1.5*iqr
      x[i]
    }
    
    with(df1, tapply(Time, ID, FUN = outlier2))
    #> $`1`
    #> [1] 5
    #> 
    #> $`2`
    #> numeric(0)
    #> 
    #> $`3`
    #> numeric(0)
    

    Created on 2023-09-30 with reprex v2.0.2