rfunctionrangeoutliersgroup

Creating an R function (or a loop) to count the number of samples that lie above and below a range for each column in a data frame


I have a data set with different groups and many columns with different variables. For each column I want to calculate the range of values in one group, then count how many rows have a value that is above or below that range in the other groups. I have written an R scipt that does this for a single column, but as the dataframe has multiple columns I want to either create a function or loop that would allow me to apply this code to multiple column in the datafame without having to copy and paste the code.

a simplified example data frame.

sample <- c(1, 2, 3, 4, 7, 8, 9, 10, 11, 13, 14, 15)
change <- c('normal1', 'normal2', 'normal3', 'normal4', 'high', 'high', 'high',                  'low', 'low', 'medium', 'medium', 'medium')
x <- c(10, 15, 12, 13, 7, 25, 15, 5, 11, 21, 28, NA)
y <- c(0.6, 0.5, 0.3, 0.8, 0.2, 0.9, 0.5, 0.5, 0.3, 0.8, 1.0, 0.1)
z <- c(100, 200, 150, 110, 90, 500, 50, 40, 150, 100, 75, NA)
data <- data.frame(sample, change, x, y, z)

find the min and max values in the "normal" group

normal <- c('normal1', 'normal2', 'normal3', 'normal4')

minmax_normal <- data %>% 
  filter(change %in% normal) %>%
  group_by(change %in% normal) %>% 
  summarise(across(where(is.numeric), list(min=min, max=max), na.rm = TRUE))

count the number of samples in other groups (low, medium, high) where x is higher or lower than the range found in the "normal" groups. Store in dataframe outliers_count

outliers_count <- data %>% 
  filter(!is.na(x)) %>%
  group_by(change) %>%
  mutate(lower_bound= minmax_normal$x_min,
         upper_bound= minmax_normal$x_max)%>%
  mutate(is_pos_outlier=if_else(x>upper_bound,TRUE,FALSE))%>%
  mutate(is_neg_outlier=if_else(x<lower_bound,TRUE,FALSE))%>%   
summarise(x_pos_outliers=sum(is_pos_outlier),x_neg_outliers=sum(is_neg_outlier))

I need to perform the same calculation on all columns (x, y, z) either by converting this to a function to make it easier to apply to other columns, or a way to perform on all columns at the same time or write as a loop instead.

I've tried creating the below function

count_outliers <- function(data, column, min, max) {
  data %>% 
    filter(!is.na(column)) %>%
    group_by(change) %>%
    mutate(lower_bound= min,
     upper_bound= max)%>%
    mutate(is_pos_outlier=if_else(column>upper_bound,TRUE,FALSE))%>%
    mutate(is_neg_outlier=if_else(column<lower_bound,TRUE,FALSE))%>%
    summarise(x_pos_outliers=sum(is_pos_outlier), x_neg_outliers=sum(is_neg_outlier))
}

call function to calculate outliers above and below range in x

 x_outliers <- count_outliers(data, x, minmax_normal$x_min,minmax_normal$x_max)

which returns the error

Error in `mutate()`:
ℹ In argument: `is_pos_outlier = if_else(data$x > upper_bound,     TRUE, FALSE)`.
ℹ In group 1: `change = "high"`.
Caused by error:
! `is_pos_outlier` must be size 3 or 1, not 12.

Part of the issue seems to be that I'm trying to pull values from two different dataframes in one function I think, I also tried joining data with minmax_normal so everything was in one data.frame but I coulndt get that to work either. Trying to get better at not copy pasting code and I thought this would be straight forward but I've tried a few different approaches and no luck yet. Any help appreciated.


Solution

  • I have found the answer. You have to use embracing: {{ column }}. This is because the x in your final line of code is actually meant to be data$x. When you use {{ }}, R knows that you mean the column of another argument.

    I recommend you to to add remove(x, y, z) to your code. This way, R can't confuse data$x (which is supposed to get filtered) with the vector x (which doesn't get filtered).

    remove(x, y, z)
    count_outliers <- function(data, column, min, max) {
      data %>% 
        filter(!is.na({{ column }})) %>%
        group_by(change) %>%
        mutate(lower_bound = min,
               upper_bound = max) %>%
        mutate(is_pos_outlier = if_else({{ column }} > upper_bound, TRUE, FALSE),
               is_neg_outlier = if_else({{ column }} < lower_bound, TRUE, FALSE)) %>%
        summarise(x_pos_outliers = sum(is_pos_outlier), x_neg_outliers = sum(is_neg_outlier))
    }
    x_outliers <- count_outliers(data, x, minmax_normal$x_min, minmax_normal$x_max)
    

    Just so you know, some functions require you to embrace like this: pick({{ column }}). Examples of these functions are select(), relocate(), rename(), rename_with() and all of the slice() variants. These functions use tidy selection rather than data masking. More information can be found here: https://r4ds.hadley.nz/functions#data-frame-functions