Note: Question edited to simplify the data.
I have a few datasets combined in a dataframe, which I want to eliminate outliers from. When trying different ways to calculate upper and lower thresholds I came upon discrepacies between the results of ggplot-boxplots and manual calculation. I would like to (1) understand the discrepancies and (2) find a convenient way to eliminate outliers from multiple similar datasets via dplyr.
Example data (taken from the boxplot.stats help page):
library(tidyverse)
require(stats)
x <- c(1:100, 1000)
(b1 <- boxplot.stats(x))
This returns ymin = 1 and max = 100 and 1000 as an obvious outlier.
My initial approach can be reproduced with this data, too... to have it complete, here is my original dataset too (however discussion is based on the simpler one).
# small dataset
values_A30 <- c(0.2079762,0.2605029,0.3054334,0.304067,0.8696487,0.3470931,0.2560001,0.3096838,0.2887556,0.3741472,0.2178375,0.2234682,0.2628923,0.2745458,0.5438208,0.7068278,0.6492924,1.100175,0.2740491,0.2849299,0.7562737,0.3009749,0.2598575,0.3460925,0.3265929,0.4208336,0.4353992,1.132036,0.3856708,0.1978752,0.3676808,0.4196799,0.4486595,0.3282394,0.3725664,0.385373,0.3680049,0.7875058,0.8098903,0.741165,1.260887,0.3521471,0.3883195,1.17124,0.3225514,0.3492051)
bpdata <- data.frame(Var1 = "A", SSW = 30, Value = values_A30)
Usually I start up with ggplot boxplots to get a visual impression.
# test plot
ggplot(as.data.frame(x),
aes(x = NA, y = x)) +
geom_boxplot()
#extract stats
gg_stats <- ggplot_build(
ggplot(as.data.frame(x),
aes(x = NA, y = x)) +
geom_boxplot()
)$data[[1]] |>
select(1:9)
# manual stats
man_stats <- as.data.frame(x) |>
summarise(
Qu1 = quantile(x, 0.25),
Qu3 = quantile(x, 0.75),
IQR = IQR(x)
) |>
mutate(ymin = Qu1 - (1.5 * IQR),
ymax = Qu3 + (1.5 * IQR))
Both, boxplot.stats and ggplot_build, lead to the same results. The manual calculation, however, leads to different results, ymin = -49 and ymax = 151. The values for the quartiles and IQR are correct and the rest of the calculation seems not to be too complicated, so: What am I doing wrong?
I worked more on the problem described and found the answer with the hints below and also on [this page] (Box-plot R calculating outliers).
To sum it up:
Thanks again for help @Michiel Duvekot
It's perhaps easier to see what's happening if you use a very simple range of values
Value <- 0:100
Qu1 <- Value |> quantile(0.25)
Qu3 <- Value |> quantile(0.75)
IQR <- Value |> IQR()
min <- Qu1 - (1.5 * IQR)
max <- Qu3 + (1.5 * IQR)
cat(min, max)
gives
-50 150
if you want to know which values are considered outliers, you can use
(bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value) %>% boxplot.stats())$out
If you want to calculate the min and max yourself, you can do what boxplt.stats does:
y <- bpdata %>% filter(Var1 == "A", SSW == 30) %>% pull(Value)
stats <- stats::fivenum(y, na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
min <- (stats[2] - coef * iqr)
max <- (stats[4] + coef * iqr)