I am trying to identify Outiers in a Dataset, where each sample has a different amount of Data points. Meaning the Matrix length differs from sample to sample. Since I am working on a big data set it is really Time consuming to change matrix length manually. I hope to find a solution for that problem here:) Also suggestions for R are welcome!
df1 <-
structure(list(
Time = c(5, 17, 18, 24, 4, 8, 9, 10, 16, 15, 4, 13, 9, 17),
ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)),
row.names = c(NA, -14L),
class = c("tbl_df", "tbl", "data.frame" ))
I used the Function:
=IF(OR(B2>QUARTILE($B$2:$B$5;3)+1,5*(QUARTILE($B$2:$B$5;3)-QUARTILE($B$2:$B$5;1));F2<QUARTILE($B$2:$B$51;1)-1,5*(QUARTILE($B$2:$B$5;3)-QUARTILE($B$2:$B$5;1)));1;0)
But as described above, with that approach I need to change the matrix manually from sample to sample.
The outlier is then identified with "1".
Here are two ways, with quantile/IQR
and with the statistics boxplot
uses. The latter is fastest.
outlier <- function(x, na.rm = FALSE) {
qq <- quantile(x, c(1/4, 3/4), na.rm = na.rm)
iqr <- IQR(x, na.rm = na.rm)
x < qq[1L] - 1.5*iqr | x > qq[2L] + 1.5*iqr
}
outlier_bp <- function(x, na.rm = FALSE) {
x %in% boxplot.stats(x)$out
}
sapply(iris[1:4], outlier) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
#> [6,] FALSE FALSE FALSE FALSE
sapply(iris[1:4], outlier_bp) |> head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] FALSE FALSE FALSE FALSE
#> [2,] FALSE FALSE FALSE FALSE
#> [3,] FALSE FALSE FALSE FALSE
#> [4,] FALSE FALSE FALSE FALSE
#> [5,] FALSE FALSE FALSE FALSE
#> [6,] FALSE FALSE FALSE FALSE
library(microbenchmark)
mb <- microbenchmark(
qnt = sapply(iris[1:4], outlier),
bxp = sapply(iris[1:4], outlier_bp)
)
print(mb, order = "median")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> bxp 233.3 321.15 388.300 365.55 454.15 738.9 100 b
#> qnt 551.2 742.65 875.958 817.95 949.30 3593.9 100 a
Created on 2023-09-25 with reprex v2.0.2
After the OP comments, here is a better use case for the function outlier
above.
Split Time
by ID
and apply the function to each sub-vector. This is done with tapply
.
i_out <- with(df1, tapply(Time, ID, FUN = outlier))
i_out
#> $`1`
#> [1] TRUE FALSE FALSE FALSE
#>
#> $`2`
#> [1] FALSE FALSE FALSE
#>
#> $`3`
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
out <- with(df1, tapply(Time, ID, FUN = \(x) x[outlier(x)]))
out
#> $`1`
#> [1] 5
#>
#> $`2`
#> numeric(0)
#>
#> $`3`
#> numeric(0)
Created on 2023-09-30 with reprex v2.0.2
Or with a function that returns the outliers, not an index to them,
outlier2 <- function(x, na.rm = FALSE) {
qq <- quantile(x, c(1/4, 3/4), na.rm = na.rm)
iqr <- IQR(x, na.rm = na.rm)
i <- x < qq[1L] - 1.5*iqr | x > qq[2L] + 1.5*iqr
x[i]
}
with(df1, tapply(Time, ID, FUN = outlier2))
#> $`1`
#> [1] 5
#>
#> $`2`
#> numeric(0)
#>
#> $`3`
#> numeric(0)
Created on 2023-09-30 with reprex v2.0.2