Below is an example using the mtcars dataset. There is one outlier with a value of 33.9, but I want a function that finds all of them for a given column.
library(dplyr)
library(ggplot2)
mtcars %>%
ggplot(aes(x = "", y = mpg)) +
geom_boxplot(fill = "#2645df")
I do not know the formula for boxplot whisker limits, so I used the plot above to find that value and then changed it manually:
res = ifelse(mtcars$mpg > 33, "outlier", "not outlier")
res = ifelse(mtcars$mpg < 10, "outlier", "not outlier")
This approach is both inefficient, and incorrect: 33 is not the lower limit for outliers, neither is 10.
I was able to achieve my desired output. Using the formula for boxplot outliers I was able to make two neat functions that not only serve the desired purpose, but also work within the tidyverse semantics:
# smaller function to find the boxplot wisker limits:
outlierLimits = function(x, extreme = F){
qts = quantile(x, c(.25, .75), names = F)
IQR = qts[2] - qts[1]
ret = c(
lower = qts[1] - IQR*1.5,
upper = qts[2] + IQR*1.5,
lower.extreme = qts[1] - IQR*3,
upper.extreme = qts[2] + IQR*3
)[c(T, T, extreme, extreme)]
return(ret)
}
# The function I was looking for:
outlierClassify = function(x, extreme = F,
labels = c("regular", "outlier",
"extreme")[c(T,T,extreme)]){
lims = outlierLimits( x, extreme )
ret = ifelse(x > lims[1] & x < lims[2],
labels[1], labels[2])
if(extreme){
ret[ ret != labels[1] ] = ifelse(
x[ ret != labels[1] ] > lims[3] &
x[ ret != labels[1] ] < lims[4],
labels[2], labels[3]
)
}
return(ret)
}
This way, the outlierClassify function returns a character vector that relates to the input vector x
.
Some great use case examples are:
# simply obtaining the resulted vector
outlierClassify(mtcars$mpg, F)
# using it with mutate()
library(dplyr)
test = mtcars %>%
select(mpg, cyl) %>%
mutate(car = rownames(mtcars),
.before = 1) %>%
# added an 'extreme' oulier for examplification
rbind(data.frame(
car = "UNO Mille", mpg = 34, cyl = 6
)) %>%
group_by(cyl) %>%
mutate(outliers = outlierClassify(mpg, T),
.after = mpg)
# using it with ggplot
library(ggplot2)
test %>%
ggplot(aes(x = as.factor(cyl), y = mpg))+
geom_boxplot(outlier.shape = NA, fill = "#2645df", alpha = .6)+
geom_jitter(aes(color = outliers), width = .1)+
#making it pretty
scale_color_manual(values = c("red", "darkorange", "black"))+
theme_minimal()+
theme(plot.background = element_rect(fill = "wheat3"))