TL:DR- I would like to create a function that automates the process of creating a new column with values "Agree" or "Disagree" based on the underlying metadata (value labels) for the column undergoing the transformation. I would like this to be able operate inside of both dplyr::across()
and dplyr::mutate()
. Moreover, it needs two arguments, one for the variable/column that will be recoded and one for the data frame that is piped into dplyr::mutate()
. I believe it needs the data frame so that it can access the underlying metadata of the column being operated on.
EDIT: I was able to create a much simpler function that works based on some of the code from @Mark.
# make data
data <- tibble::tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
) %>%
# add value labels
labelled::set_value_labels(
x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
)
# write the function
make_dicho <- function(df = data, var) {
var <- rlang::enexpr(var)
if (!is.character(var)) {
# convert to a sym() object and then use as_name to make it a string
var <- rlang::as_name(rlang::ensym(var))
}
# conver the vector to a factor
haven::as_factor(df[[var]]) %>%
# remove the first part of the factor
stringr::str_extract("(?<=\\s).+") %>%
# make the first letter uppercase
stringr::str_to_sentence()
}
# check that it outputs a vector
make_dicho(data, x)
# check that it works inside dplyr::mutate()
data %>% dplyr::mutate(new_x = make_dicho(., x))
# check that it works in dplyr::across()
data %>% mutate(
across(
c(x:z),
\(var) make_dicho(., var),
.names = "new_{col}"
)
)
I am social scientist who works with survey data a lot. Many of the variables are four point agree-disagree likert scales with response options "Strongly agree", "Somewhat agree", "Somewhat disagree", "Strongly disagree", but sometimes are six point scales. A consistent part of the data cleaning process is to convert these variables into dichotomous factors (meaning they have two response options of "Agree" and Disagree"). Here is an example below where data
is the data frame, x
is the original variable with all four response options, and new_x
is the dichotomized variable:
pacman::p_load(tidyverse, labelled, rlang)
data %>%
dplyr::mutate(
new_x = dplyr::case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
)
The issue is that I often have over 30+ variables that I have to do this with. I know that I can use across()
to do this same data transformation over all 30 variables, but I have to repeat this every couple of weeks when we get new survey data back. Instead, I would like to have a function called something like make_dicho()
that I can use inside of mutate()
and across()
so that I don't have to write the entire case_match()
expression out every single time. Here is a successful attempt at building a rudimentary version:
# create sample data
data <- tibble::tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
)
df
# create the function where values of 1-2 are "Agree" and 3-4 are "Disagree"
make_dicho <- function(var) {
dplyr::case_match(
x,
c(1:2) ~ "Agree",
c(3:4) ~ "Disagree"
)
}
# check to see if it worked
df %>% dplyr::mutate(new_x = make_dicho(x))
# success!
This function works, but it is very fragile as it relies on the survey designer and survey provider using four response options and coding the values in a very specific way. One way to avoid this is to leverage the underlying metadata which contains value labels that indicate what each value means. Since most of my data contains this metadata, I would like to use it to automatically decide which values should be recoded as "Agree" and which should be recoded as "Disagree". This complicates things significantly as I now need to add a new argument for the data frame. Here is what I have come up with so far:
# add value labels to the data
data <- tibble::tribble(
~x, ~y, ~z,
3, 2, 3,
4, 4, 2,
2, 3, 1,
1, 1, 4
) %>%
# add value labels
labelled::set_value_labels(
x = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
y = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4),
z = c(`Strongly agree` = 1, `Somewhat agree` = 2, `Somewhat disagree` = 3, `Strongly disagree` = 4)
)
# write the new function
make_dicho <- function(df = NULL, var) {
## if var is a symbol convert it to a string
# "Returns a naked expression of the variable"
var <- rlang::enexpr(var)
if (!is.character(var)) {
# convert to a sym() object and then use as_name to make it a string
var <- rlang::as_name(rlang::ensym(var))
}
# Since this is taking advantage of labelled data, it should be of class haven_labelled
if (class(df[[var]])[1] == "haven_labelled") {
### Set up vectors based on the underlying attribute
# get the named vector
labs <- attributes(df[[var]])$labels
# flip the names
labs <- setNames(names(labs), labs)
# get the agree vector by removing the strings containing "disagree" or "Disagree"
agree_vec <- labs[!str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
agree_vec <- tibble::enframe(agree_vec) %>%
# put the "value" column at the beginning of the df
dplyr::relocate(value) %>%
# convert "name" to numeric
dplyr::mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
tibble::deframe()
# get the agree vector by keeping the strings containing "disagree" or "Disagree"
disagree_vec <- labs[str_detect(labs, pattern = "disagree|Disagree")]
# now flip the vector back and make it numeric
# enframe() converts named atomic vectors or lists to one- or two-column data frames.
disagree_vec <- tibble::enframe(disagree_vec) %>%
# put the "value" column at the beginning of the df
dplyr::relocate(value) %>%
# convert "name" to numeric
dplyr::mutate(name = as.numeric(name)) %>%
# deframe() converts two-column data frames to a named vector or list
tibble::deframe()
### now create the case_match function,
# Adding in df[[var]] so that it know which vector to use
dplyr::case_match(
df[[var]],
agree_vec ~ "Agree",
disagree_vec ~ "Disagree"
)
}
}
# test function
data %>% dplyr::mutate(new_x = make_dicho(x))
This fails and gives an error that says argument "var" is missing, with no default
. However, if I add .
inside make_dicho()
it works. Like this:
data %>% dplyr::mutate(new_x = make_dicho(., x))
My first question is, how do I update my function so that it no longer requires the .
at the beginning? Secondly, how do I get it to work in dplyr::across()
? Here is the code I used for dplyr::across()
:
# make all three variables dichotomous factors with "new_" prefix
df %>% dplyr::mutate(
dplyr::across(
c(x:z),
~make_dicho(., .x),
.names = "new_{col}"
)
)
Here is the error image I am getting when I try using across()
. My guess is that it has something to do with the .
inside of the make_dicho()
call and by the call to df[[var]]
found in the dplyr::case_match()
. But I honestly have no idea and, while it feels like I am very close, for all I know this function could be all messed up.
In sum, I would like to create a function that automates the process of creating a new column with values "Agree" or "Disagree" based on the underlying metadata for the column undergoing the transformation. I would like this to be able operate inside of both dplyr::across()
and dplyr::mutate()
. Moreover, it needs two arguments, one for the variable/column that will be recoded and one for the data frame that is piped into dplyr::mutate()
. I believe it needs the data frame so that it can access the underlying metadata of the column being operated on.
Hopefully the request, while a bit complicated, is easy to understand. Thank you for any and all help!
There's a few different questions you have posed, so to break it down:
This [(
data %>% mutate(new_x = make_dicho(x))
)] fails and gives an error that says argument "var" is missing, with no default. However, if I add . inside make_dicho() it works
This is because make_dicho has two arguments, df
, with the default value NULL
, and var
. If you give only one argument, it will assume that it's the first one, df
, hence the error.
My first question is, how do I update my function so that it no longer requires the . at the beginning?
There's a few different ways:
make_dicho <- function(df = NULL, var) {
to make_dicho <- function(var, df = data) {
. This will obviously not work if the dataframe you're working with changes namedata |> mutate(across(x:z, \(a) str_extract(names(val_labels(a))[a], "(?<=\\s).+"), .names = "new_{col}"))
# or
make_dicho <- \(a) str_extract(names(val_labels(a))[a], "(?<=\\s).+")
data |> mutate(across(x:z, make_dicho, .names = "new_{col}"))
Secondly, how do I get it to work in across()?[...] Here is the error image I am getting when I try using across(). My guess is that it has something to do with the . inside of the make_dicho() call and by the call to df[[var]] found in the case_match. But I honestly have no idea and, while it feels like I am very close, for all I know this function could be all messed up.
Your assumption is right- the problem is that in that format of anonymous function (i.e. ~ .x
etc.), you can use .
instead. See:
data |> mutate(across(everything(), ~ ., .names = "new_{col}"))
Output:
# A tibble: 4 × 6
x y z new_x new_y new_z
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+l> <dbl+l>
1 3 [Somewhat disagree] 2 [Somewhat agree] 3 [Somewh… 3 [Som… 2 [Som… 3 [Som…
2 4 [Strongly disagree] 4 [Strongly disagree] 2 [Somewh… 4 [Str… 4 [Str… 2 [Som…
3 2 [Somewhat agree] 3 [Somewhat disagree] 1 [Strong… 2 [Som… 3 [Som… 1 [Str…
4 1 [Strongly agree] 1 [Strongly agree] 4 [Strong… 1 [Str… 1 [Str… 4 [Str…
The solution is to either define a function as a separate thing, or use the new \(x) x
syntax, like so:
data %>% mutate(
across(
x:z,
\(a) make_dicho(., a),
.names = "new_{col}"
)
)