rdplyr

How to pass string into function from within mutate


I'm working with pre/post assessment data. Each question has a score out of 5 and is collected at two timepoints for each student.

library(tidyverse)
name <- c("Student 1", "Student 2", "Student 3")
Q1_1_pre <- c(5, 1, 4)
Q1_1_post <- c(5, 2, 5)
Q1_2_pre <- c(4, 4, 2)
Q1_2_post <- c(5, 3, 5)
my_df <- data.frame(name, Q1_1_pre, Q1_1_post, Q1_2_pre, Q1_2_post)

My goal is to add a new boolean column for each question that indicates whether the student's score on that question improved from pre to post. I originally implemented as below, but it's obviously terrible coding for the full dataset of 50+ questions.

pre_post_imp <- my_df %>%
  mutate("Q1_1_imp" = `Q1_1_post` > `Q1_1_pre`,
         "Q1_2_imp" = `Q1_2_post` > `Q1_2_pre`)

I thought I could:

  1. make a vector of the base question names
  2. make a function that does the pre/post comparison, and then
  3. use something like map() or mutate() with across() to apply the function for all questions.

But I can't even get the function to work:

q_names <- c("Q1_1", "Q1_2") #example of what the vector would look like

greater_than <- function(df, name){  
  pre <- paste0(name, "_pre")
  post <- paste0(name, "_post")
  df[[post]] > df[[pre]]
}

#testing if the function works - it doesn't
pre_post <- my_df %>%   
  mutate("Q1_1_imp" = greater_than(., "Q1_1"),
         "Q1_2_imp" = greater_than(., "Q1_2"),
         "Q1_3_imp" = greater_than(., "Q1_3"))

I think the issue is that a (nonexistent) column is getting passed as the second argument while I want a string to be passed, but I can't figure out how to fix this.

I know I could use a for loop as well - does that make more sense than a tidyverse solution in this case?


Solution

  • I think the simplest and most legible approach might be to reshape your data to put each question for each student into separate rows, and each timepoint into columns, since those two values are intrinsically linked in your analysis. (At least at this stage.)

    Once the information currently encoded in column names is converted into variables, further analysis will be simpler. In your example, |> mutate(imp = post > pre) would suffice.

    my_df |>
      pivot_longer(-name,             # pivot all columns except name
                   names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
                   names_sep = "_")   # separate at underscores 
    
    ## A tibble: 6 × 5
    #  name      Q     num     pre  post
    #  <chr>     <chr> <chr> <dbl> <dbl>
    #1 Student 1 Q1    1         5     5
    #2 Student 1 Q1    2         4     5
    #3 Student 2 Q1    1         1     2
    #4 Student 2 Q1    2         4     3
    #5 Student 3 Q1    1         4     5
    #6 Student 3 Q1    2         2     5
    

    In this format, we could also very simply analyze what the average change was, or verify how many instances had both a pre and post value, or which students or questions were missing one or the other, etc. In this context, the pre/post observations are paired so closely that it's simpler to think of them as one meta-observation.


    If the wide format is needed for reporting or further analysis, you could pivot wider again:

    my_df |>
      pivot_longer(-name,             # pivot all columns except name
                   names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
                   names_sep = "_") |>  # separate at underscores 
      mutate(imp = post > pre) |>
      pivot_wider(names_from = c(Q, num),
                  values_from = pre:imp, 
                  names_glue = "{Q}_{num}_{.value}", names_vary = "slowest")
    
    # A tibble: 3 × 7
      name      Q1_1_pre Q1_1_post Q1_1_imp Q1_2_pre Q1_2_post Q1_2_imp
      <chr>        <dbl>     <dbl> <lgl>       <dbl>     <dbl> <lgl>   
    1 Student 1        5         5 FALSE           4         5 TRUE    
    2 Student 2        1         2 TRUE            4         3 FALSE   
    3 Student 3        4         5 TRUE            2         5 TRUE  
    

    Or if you want to plot the data with ggplot2, it could help to make it long format where each test is in a different row:

    my_df |>
      pivot_longer(-name,             # pivot all columns except name
                   names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
                   names_sep = "_") |>  # separate at underscores 
      mutate(imp = post > pre) |>
      pivot_longer(pre:post, names_to = "time") |>
      mutate(time = factor(time) |> fct_inorder()) |>
      ggplot(aes(time, value, color = imp,
                 group = interaction(name, Q, num))) +
      geom_line() +
      facet_wrap(~name)
    

    enter image description here