rdplyrmetadatametaprogrammingnse

Utilizing value labels in custom wrapper function around dplyr::case_match() to go inside of dplyr::mutate() and dplyr::across()


TL:DR- I would like to create a function that automates the process of creating a new column with values "Agree" or "Disagree" based on the underlying metadata (value labels) for the column undergoing the transformation. I would like this to be able operate inside of both dplyr::across() and dplyr::mutate(). Moreover, it needs two arguments, one for the variable/column that will be recoded and one for the data frame that is piped into dplyr::mutate(). I believe it needs the data frame so that it can access the underlying metadata of the column being operated on.

EDIT: I was able to create a much simpler function that works based on some of the code from @Mark.

# make data
data <- tibble::tribble(
  ~x, ~y, ~z,
  3, 2, 3,
  4, 4, 2,
  2, 3, 1,
  1, 1, 4
) %>% 
  # add value labels
  labelled::set_value_labels(
    x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
  )

# write the function
make_dicho <- function(df = data, var) {
  var <- rlang::enexpr(var)
  if (!is.character(var)) {
    # convert to a sym() object and then use as_name to make it a string
    var <- rlang::as_name(rlang::ensym(var))
  }
  
  # conver the vector to a factor
  haven::as_factor(df[[var]]) %>% 
    # remove the first part of the factor
    stringr::str_extract("(?<=\\s).+") %>% 
    # make the first letter uppercase
    stringr::str_to_sentence()
}

# check that it outputs a vector
make_dicho(data, x)

# check that it works inside dplyr::mutate()
data %>% dplyr::mutate(new_x = make_dicho(., x))

# check that it works in dplyr::across()
data %>% mutate(
  across(
    c(x:z),
    \(var) make_dicho(., var),
    .names = "new_{col}"
  )
)

I am social scientist who works with survey data a lot. Many of the variables are four point agree-disagree likert scales with response options "Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree", but sometimes are six point scales. A consistent part of the data cleaning process is to convert these variables into dichotomous factors (meaning they have two response options of "Agree" and Disagree"). Here is an example below where data is the data frame, x is the original variable with all four response options, and new_x is the dichotomized variable:

pacman::p_load(tidyverse, labelled, rlang)

data %>% 
  dplyr::mutate(
    new_x = dplyr::case_match(
      x,
      c(1:2) ~ "Agree",
      c(3:4) ~ "Disagree"
    )
  )

The issue is that I often have over 30+ variables that I have to do this with. I know that I can use across() to do this same data transformation over all 30 variables, but I have to repeat this every couple of weeks when we get new survey data back. Instead, I would like to have a function called something like make_dicho() that I can use inside of mutate() and across() so that I don't have to write the entire case_match() expression out every single time. Here is a successful attempt at building a rudimentary version:

# create sample data
data <- tibble::tribble(
  ~x, ~y, ~z,
  3, 2, 3,
  4, 4, 2,
  2, 3, 1,
  1, 1, 4
)

df

# create the function where values of 1-2 are "Agree" and 3-4 are "Disagree"
make_dicho <- function(var) {
  dplyr::case_match(
    x,
    c(1:2) ~ "Agree",
    c(3:4) ~ "Disagree"
  )
}

# check to see if it worked
df %>% dplyr::mutate(new_x = make_dicho(x))

# success!

This function works, but it is very fragile as it relies on the survey designer and survey provider using four response options and coding the values in a very specific way. One way to avoid this is to leverage the underlying metadata which contains value labels that indicate what each value means. Since most of my data contains this metadata, I would like to use it to automatically decide which values should be recoded as "Agree" and which should be recoded as "Disagree". This complicates things significantly as I now need to add a new argument for the data frame. Here is what I have come up with so far:

# add value labels to the data

data <- tibble::tribble(
  ~x, ~y, ~z,
  3, 2, 3,
  4, 4, 2,
  2, 3, 1,
  1, 1, 4
) %>% 
  # add value labels
  labelled::set_value_labels(
    x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
    z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
  )

# write the new function
make_dicho <- function(df = NULL, var) {

  ## if var is a symbol convert it to a string
  # "Returns a naked expression of the variable"
  var <- rlang::enexpr(var)
  
  if (!is.character(var)) {
    # convert to a sym() object and then use as_name to make it a string
   var <- rlang::as_name(rlang::ensym(var))
  }

  # Since this is taking advantage of labelled data, it should be of class haven_labelled 
  if (class(df[[var]])[1] == "haven_labelled") {

    ### Set up vectors based on the underlying attribute
    
    # get the named vector
    labs <- attributes(df[[var]])$labels
    
    
    # flip the names
    labs <- setNames(names(labs), labs)
    
    
    # get the agree vector by removing the strings containing "disagree" or "Disagree"
    agree_vec <- labs[!str_detect(labs, pattern = "disagree|Disagree")]
    
    # now flip the vector back and make it numeric
    # enframe() converts named atomic vectors or lists to one- or two-column data frames.
    agree_vec <- tibble::enframe(agree_vec) %>%
      # put the "value" column at the beginning of the df
      dplyr::relocate(value) %>%
      # convert "name" to numeric
      dplyr::mutate(name = as.numeric(name)) %>%
      # deframe() converts two-column data frames to a named vector or list
      tibble::deframe()
    
    
    # get the agree vector by keeping the strings containing "disagree" or "Disagree"
    disagree_vec <- labs[str_detect(labs, pattern = "disagree|Disagree")]
    
    # now flip the vector back and make it numeric
    # enframe() converts named atomic vectors or lists to one- or two-column data frames.
    disagree_vec <- tibble::enframe(disagree_vec) %>%
      # put the "value" column at the beginning of the df
      dplyr::relocate(value) %>%
      # convert "name" to numeric
      dplyr::mutate(name = as.numeric(name)) %>%
      # deframe() converts two-column data frames to a named vector or list
      tibble::deframe()
    
    
    
    ### now create the case_match function,
    # Adding in df[[var]] so that it know which vector to use
    dplyr::case_match(
      df[[var]],
      agree_vec ~ "Agree",
      disagree_vec ~ "Disagree"
    ) 
    
    }

}

# test function
data %>% dplyr::mutate(new_x = make_dicho(x))

This fails and gives an error that says argument "var" is missing, with no default. However, if I add . inside make_dicho() it works. Like this:

data %>% dplyr::mutate(new_x = make_dicho(., x))

My first question is, how do I update my function so that it no longer requires the . at the beginning? Secondly, how do I get it to work in dplyr::across()? Here is the code I used for dplyr::across():

# make all three variables dichotomous factors with "new_" prefix
df %>% dplyr::mutate(
  dplyr::across(
    c(x:z),
    ~make_dicho(., .x),
    .names = "new_{col}"
  )
)

Here is the error image I am getting when I try using across(). My guess is that it has something to do with the . inside of the make_dicho() call and by the call to df[[var]] found in the dplyr::case_match(). But I honestly have no idea and, while it feels like I am very close, for all I know this function could be all messed up.

enter image description here

In sum, I would like to create a function that automates the process of creating a new column with values "Agree" or "Disagree" based on the underlying metadata for the column undergoing the transformation. I would like this to be able operate inside of both dplyr::across() and dplyr::mutate(). Moreover, it needs two arguments, one for the variable/column that will be recoded and one for the data frame that is piped into dplyr::mutate(). I believe it needs the data frame so that it can access the underlying metadata of the column being operated on.

Hopefully the request, while a bit complicated, is easy to understand. Thank you for any and all help!


Solution

  • There's a few different questions you have posed, so to break it down:

    This [(data %>% mutate(new_x = make_dicho(x)))] fails and gives an error that says argument "var" is missing, with no default. However, if I add . inside make_dicho() it works

    This is because make_dicho has two arguments, df, with the default value NULL, and var. If you give only one argument, it will assume that it's the first one, df, hence the error.

    My first question is, how do I update my function so that it no longer requires the . at the beginning?

    There's a few different ways:

    1. Change make_dicho <- function(df = NULL, var) { to make_dicho <- function(var, df = data) {. This will obviously not work if the dataframe you're working with changes name
    2. Make it something simpler, e.g. something like this:
    data |> mutate(across(x:z, \(a) str_extract(names(val_labels(a))[a], "(?<=\\s).+"), .names = "new_{col}"))
    # or 
    make_dicho <- \(a) str_extract(names(val_labels(a))[a], "(?<=\\s).+")
    data |> mutate(across(x:z, make_dicho, .names = "new_{col}"))
    

    Secondly, how do I get it to work in across()?[...] Here is the error image I am getting when I try using across(). My guess is that it has something to do with the . inside of the make_dicho() call and by the call to df[[var]] found in the case_match. But I honestly have no idea and, while it feels like I am very close, for all I know this function could be all messed up.

    Your assumption is right- the problem is that in that format of anonymous function (i.e. ~ .x etc.), you can use . instead. See:

    data |> mutate(across(everything(), ~ ., .names = "new_{col}"))
    

    Output:

    # A tibble: 4 × 6
      x                     y                     z          new_x   new_y   new_z  
      <dbl+lbl>             <dbl+lbl>             <dbl+lbl>  <dbl+l> <dbl+l> <dbl+l>
    1 3 [Somewhat disagree] 2 [Somewhat agree]    3 [Somewh… 3 [Som… 2 [Som… 3 [Som…
    2 4 [Strongly disagree] 4 [Strongly disagree] 2 [Somewh… 4 [Str… 4 [Str… 2 [Som…
    3 2 [Somewhat agree]    3 [Somewhat disagree] 1 [Strong… 2 [Som… 3 [Som… 1 [Str…
    4 1 [Strongly agree]    1 [Strongly agree]    4 [Strong… 1 [Str… 1 [Str… 4 [Str…
    

    The solution is to either define a function as a separate thing, or use the new \(x) x syntax, like so:

    data %>% mutate(
      across(
        x:z,
        \(a) make_dicho(., a),
        .names = "new_{col}"
      )
    )