rstatisticstidyversedata-handling

Problem with parsing variables in a R function


I'm working with a dataset in R and I want to create a function that separates values into three categories based on quartiles. Values below the first quartile (Q1) should be labeled "Low expression", values above the third quartile (Q3) should be labeled "High expression", and everything else should be labeled "NA".

I wrote the following function, but it keeps considering everything as high expression. Can someone help me with this one?

function


library(tidyverse)


fun_1 <- function(df, var_1) {
  
  quants <- quantile(df[[var_1]], c(.25, .75))
  
  tmp_1 <- df %>%
    # selecting var_1
    select({{var_1}}) %>%
    # creating a new variable classifying Q1 and Q4
    mutate(
      expression =
        case_when(
          {{var_1}} >= quants[["75%"]] ~ sprintf(
            "High expression ( >= %.02f )", quants[["75%"]]),
          {{var_1}} <= quants[["25%"]] ~ sprintf(
            "Low expression ( <= %.02f )", quants[["25%"]]),
          .default = NA)) %>%
    # excluding Q2 and Q3
    filter(!is.na(expression))
          
  tmp_1
}

fun_1(iris, "Sepal.Length")

After running the code, I get the following output: output of the function:

output of the function with Iris dataset

However, I expect the function to remove values between the first (Q1) and third quartile (Q3), including low expression values (below Q1) and high expression values (above Q4).


Solution

  • I think the {{var_1}} is not working as intended there. A way to fix that is to convert var_1 to a symbol and use rlang's !! referencing.

    fun_2 <- function(df, var_1) {
      library(dplyr)
      sym <- as.symbol(var_1)
      quants <- quantile(df[[var_1]], c(0.25, 0.75), names = FALSE)
      df %>% 
        select(!!sym) %>%
        mutate(expression = case_when(
          !!sym < quants[1] ~ sprintf("Low expression ( < %.02f )", quants[1]),
          !!sym >= quants[2] ~ sprintf("High expression ( >= %.02f )", quants[2]),
          .default = "NA")
        ) %>%
        filter(expression != "NA")
    }
    fun_2(iris, "Sepal.Length")
    #    Sepal.Length                  expression
    # 1           4.9   Low expression ( < 5.10 )
    # 2           4.7   Low expression ( < 5.10 )
    # 3           4.6   Low expression ( < 5.10 )
    # 4           5.0   Low expression ( < 5.10 )
    # 5           4.6   Low expression ( < 5.10 )
    # 6           5.0   Low expression ( < 5.10 )
    # 7           4.4   Low expression ( < 5.10 )
    # 8           4.9   Low expression ( < 5.10 )
    # 9           4.8   Low expression ( < 5.10 )
    # 10          4.8   Low expression ( < 5.10 )
    # 11          4.3   Low expression ( < 5.10 )
    # 12          4.6   Low expression ( < 5.10 )
    # 13          4.8   Low expression ( < 5.10 )
    # 14          5.0   Low expression ( < 5.10 )
    # 15          5.0   Low expression ( < 5.10 )
    # 16          4.7   Low expression ( < 5.10 )
    # 17          4.8   Low expression ( < 5.10 )
    # 18          4.9   Low expression ( < 5.10 )
    # 19          5.0   Low expression ( < 5.10 )
    # 20          4.9   Low expression ( < 5.10 )
    # 21          4.4   Low expression ( < 5.10 )
    # 22          5.0   Low expression ( < 5.10 )
    # 23          4.5   Low expression ( < 5.10 )
    # 24          4.4   Low expression ( < 5.10 )
    # 25          5.0   Low expression ( < 5.10 )
    # 26          4.8   Low expression ( < 5.10 )
    # 27          4.6   Low expression ( < 5.10 )
    # 28          5.0   Low expression ( < 5.10 )
    # 29          7.0 High expression ( >= 6.40 )
    # 30          6.4 High expression ( >= 6.40 )
    # 31          6.9 High expression ( >= 6.40 )
    # 32          6.5 High expression ( >= 6.40 )
    # 33          4.9   Low expression ( < 5.10 )
    # 34          6.6 High expression ( >= 6.40 )
    # 35          5.0   Low expression ( < 5.10 )
    # 36          6.7 High expression ( >= 6.40 )
    # 37          6.4 High expression ( >= 6.40 )
    # 38          6.6 High expression ( >= 6.40 )
    # 39          6.8 High expression ( >= 6.40 )
    # 40          6.7 High expression ( >= 6.40 )
    # 41          6.7 High expression ( >= 6.40 )
    # 42          5.0   Low expression ( < 5.10 )
    # 43          7.1 High expression ( >= 6.40 )
    # 44          6.5 High expression ( >= 6.40 )
    # 45          7.6 High expression ( >= 6.40 )
    # 46          4.9   Low expression ( < 5.10 )
    # 47          7.3 High expression ( >= 6.40 )
    # 48          6.7 High expression ( >= 6.40 )
    # 49          7.2 High expression ( >= 6.40 )
    # 50          6.5 High expression ( >= 6.40 )
    # 51          6.4 High expression ( >= 6.40 )
    # 52          6.8 High expression ( >= 6.40 )
    # 53          6.4 High expression ( >= 6.40 )
    # 54          6.5 High expression ( >= 6.40 )
    # 55          7.7 High expression ( >= 6.40 )
    # 56          7.7 High expression ( >= 6.40 )
    # 57          6.9 High expression ( >= 6.40 )
    # 58          7.7 High expression ( >= 6.40 )
    # 59          6.7 High expression ( >= 6.40 )
    # 60          7.2 High expression ( >= 6.40 )
    # 61          6.4 High expression ( >= 6.40 )
    # 62          7.2 High expression ( >= 6.40 )
    # 63          7.4 High expression ( >= 6.40 )
    # 64          7.9 High expression ( >= 6.40 )
    # 65          6.4 High expression ( >= 6.40 )
    # 66          7.7 High expression ( >= 6.40 )
    # 67          6.4 High expression ( >= 6.40 )
    # 68          6.9 High expression ( >= 6.40 )
    # 69          6.7 High expression ( >= 6.40 )
    # 70          6.9 High expression ( >= 6.40 )
    # 71          6.8 High expression ( >= 6.40 )
    # 72          6.7 High expression ( >= 6.40 )
    # 73          6.7 High expression ( >= 6.40 )
    # 74          6.5 High expression ( >= 6.40 )
    

    If you read some of dplyr's "programming" docs out there, some try to make this even more convenient (to users) by removing the need for quotes around the symbol name. While I suggest that adding this non-standard evaluation (NSE) looks neat, it can sometimes make troubleshooting more difficult, especially for newer R users/programmers. But if you really want it, you could so something like this for the same results:

    fun_3 <- function(df, var_1) {
      library(dplyr)
      sym <- rlang::enquo(var_1)
      quants <- quantile(pull(df, !!sym), c(0.25, 0.75), names = FALSE)
      df %>% select(!!sym) %>%
        mutate(expression = case_when(
          !!sym < quants[1] ~ sprintf("Low expression ( < %.02f )", quants[1]),
          !!sym >= quants[2] ~ sprintf("High expression ( >= %.02f )", quants[2]),
          .default = "NA")
        ) %>%
        filter(expression != "NA")
    }
    fun_3(iris, Sepal.Length) # notice no quotes needed