rr-factormutate

How to create a new column based on information from other factor columns in R?


I have for some time now tried to find an effective solution for how to create a new column based on information from other factor columns in R. I have come up with one solution that works, but is quite tedious, and one which is far less tedious, but only partly works. I usually work with survey data containing NAs, which is an important part of the problem. And please excuse me if I do something wrong when presenting this question, as this is both my first post here at SO and my first Reprex. The question is similar to others that have been asked here at SO, but I have not been able to tweak any existing solutions to solve my problem.

[EDITED AFTER FIRST RESPONSE DUE TO BEING UNCLEAR] What I want to obtain, is a new column in my tibble that only returns NA if all columns used for generating the new column are NA, while returning "yes" if at least one column is "yes," and "no" if no column contains a "yes."

Here's a Reprex to illustrate my problem:

library(tidyverse)

# create the table
df <- tibble(
  othervar_2007 = c(NA, "yes", "no", "no", "no"),
  morv1_2007 = c(NA, "yes", "no", "no", NA),
  morv2_2007 = c(NA, "yes", NA, "no", "no"),
  morv3_2007 = c(NA, "no", "yes", "no", "no"),
  morv4_2007 = c(NA, "no", "no", "no", "no"),
  morv_othervar = c(NA, "yes", "no", "no", "no")
)

# show table
df
#> # A tibble: 5 × 6
#>   othervar_2007 morv1_2007 morv2_2007 morv3_2007 morv4_2007 morv_othervar
#>   <chr>         <chr>      <chr>      <chr>      <chr>      <chr>        
#> 1 <NA>          <NA>       <NA>       <NA>       <NA>       <NA>         
#> 2 yes           yes        yes        no         no         yes          
#> 3 no            no         <NA>       yes        no         no           
#> 4 no            no         no         no         no         no           
#> 5 no            <NA>       no         no         no         no

# alternative 1
df <- df %>% 
  mutate(newvar1 = case_when(
    morv1_2007 == "yes" | morv2_2007 == "yes" | morv3_2007 == "yes" | morv4_2007 == "yes" ~ "Yes",
    morv1_2007 == "no" | morv2_2007 == "no" | morv3_2007 == "no" | morv4_2007 == "no" ~ "No")
  )

df %>% 
  count(newvar1)
#> # A tibble: 3 × 2
#>   newvar1     n
#>   <chr>   <int>
#> 1 No          2
#> 2 Yes         2
#> 3 <NA>        1

# alternative 2
df <- df %>%
  rowwise %>%
  mutate(newvar2 = if_else(rowSums(across(c(morv1_2007, morv2_2007, morv3_2007, morv4_2007), ~ .x == "yes"))>0, "Yes", "No"))

df %>% 
  count(newvar2)
#> # A tibble: 3 × 2
#> # Rowwise: 
#>   newvar2     n
#>   <chr>   <int>
#> 1 No          1
#> 2 Yes         1
#> 3 <NA>        3
Created on 2023-11-09 with reprex v2.0.2

The tibble contains four columns which should be used for mutating the new column, as well as two mock columns which should be retained as they are. The columns are named to illustrate selection problems of relevant columns as well.

The solution labelled alternative 1 works, but is quite tedious when the number of columns you want to obtain information from gets large. I have also not been able to find a good solution for selecting columns, e.g., by using starts_with and ends_with.

The solution labelled alternative 2, however, returns NA if at least one column is NA. And when adding na.rm = TRUE, it returns "No" for the row containing only NAs. The selection issue applies here as well.

Hopefully, someone more experienced in R and coding can help me along here.

EDIT FOR FOLLOW-UP TIDYVERSE SOLUTION

df <- df %>%
  rowwise %>%
  mutate(newvar3 = if(all(is.na(c_across(matches("^morv.*2007$"))))) {
NA_character_
} else if ("yes" %in% c_across("^morv.*2007$"))) {
"yes"
} else {
"no"})

This returns what I am after, but is quite slow. The code suggested under here is way faster.


Solution

  • Based on

    "Here's what I want to obtain [...]: A new column in my tibble that only returns NA if all columns used for generating the new column are NA, while returning "yes" if at least one "yes," and "no" if no columns contains a "yes.",

    Note, I have changed df[1, 1] to "no" by df[1, 1] <- "no" for testing.

    I suggest the following approach:

    df$newvar <- 
      apply(X = df[, grepl("morv\\d{1}", colnames(df))], 
            MARGIN = 1L, 
            FUN = \(x) ifelse(
              test = all(is.na(x)), 
              yes = NA, 
              no = ifelse(
                test = any(x == "yes", na.rm = TRUE), 
                yes = "yes", 
                no = "no")
              )
            )  
    

    which gives

    > df
    # A tibble: 5 × 7
      othervar_2007 morv1_2007 morv2_2007 morv3_2007 morv4_2007 morv_othervar newvar
      <chr>         <chr>      <chr>      <chr>      <chr>      <chr>         <chr> 
    1 no            NA         NA         NA         NA         NA            NA    
    2 yes           yes        yes        no         no         yes           yes   
    3 no            no         NA         yes        no         no            yes   
    4 no            no         no         no         no         no            no    
    5 no            NA         no         no         no         no            no