rdplyrpercentilecategorical

How to remove top and bottom percentile values when both categorical and numerical columns exist in R


Consider data frame below

df <- data.frame(a=c("Y", "Y","N", "Y", "N", "N"),
                 b = c(200, 1,1.4,1.3,2,1.6),
                 c = c(200,-200,10,12,14,15),
                 d = c("f","f","m", "m","m","m"))
  a     b    c d
1 Y 200.0  200 f
2 Y   1.0 -200 f
3 N   1.4   10 m
4 Y   1.3   12 m
5 N   2.0   14 m
6 N   1.6   15 m

I want to trim data frame such that rows with values less than 1 percentile and greater than 99 percentile from the numeric columns are removed.

  a   b  c d
1 N 1.4 10 m
2 Y 1.3 12 m
3 N 2.0 14 m
4 N 1.6 15 m

I can remove top and bottom undesired values, when categorical variables are not present.

df %>% dplyr::select(is.numeric) %>%
    filter_all(all_vars(between(., quantile(., .01), quantile(., .99))))

but I do not know how to do the job while keeping categorical columns. any help or hint with is appreciated.


Solution

  • We could use if_all in filter and select the columns that are numeric with where(is.numeric)

    library(dplyr)
    df %>%
       filter(if_all(where(is.numeric),
         ~ between(.x, quantile(.x, .01), quantile(.x, .99))))
    

    -output

      a   b  c d
    1 N 1.4 10 m
    2 Y 1.3 12 m
    3 N 2.0 14 m
    4 N 1.6 15 m