I'm working with pre/post assessment data. Each question has a score out of 5 and is collected at two timepoints for each student.
library(tidyverse)
name <- c("Student 1", "Student 2", "Student 3")
Q1_1_pre <- c(5, 1, 4)
Q1_1_post <- c(5, 2, 5)
Q1_2_pre <- c(4, 4, 2)
Q1_2_post <- c(5, 3, 5)
my_df <- data.frame(name, Q1_1_pre, Q1_1_post, Q1_2_pre, Q1_2_post)
My goal is to add a new boolean column for each question that indicates whether the student's score on that question improved from pre to post. I originally implemented as below, but it's obviously terrible coding for the full dataset of 50+ questions.
pre_post_imp <- my_df %>%
mutate("Q1_1_imp" = `Q1_1_post` > `Q1_1_pre`,
"Q1_2_imp" = `Q1_2_post` > `Q1_2_pre`)
I thought I could:
map()
or mutate()
with across()
to apply the function for all questions.But I can't even get the function to work:
q_names <- c("Q1_1", "Q1_2") #example of what the vector would look like
greater_than <- function(df, name){
pre <- paste0(name, "_pre")
post <- paste0(name, "_post")
df[[post]] > df[[pre]]
}
#testing if the function works - it doesn't
pre_post <- my_df %>%
mutate("Q1_1_imp" = greater_than(., "Q1_1"),
"Q1_2_imp" = greater_than(., "Q1_2"),
"Q1_3_imp" = greater_than(., "Q1_3"))
I think the issue is that a (nonexistent) column is getting passed as the second argument while I want a string to be passed, but I can't figure out how to fix this.
I know I could use a for loop as well - does that make more sense than a tidyverse
solution in this case?
I think the simplest and most legible approach might be to reshape your data to put each question for each student into separate rows, and each timepoint into columns, since those two values are intrinsically linked in your analysis. (At least at this stage.)
Once the information currently encoded in column names is converted into variables, further analysis will be simpler. In your example, |> mutate(imp = post > pre)
would suffice.
my_df |>
pivot_longer(-name, # pivot all columns except name
names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
names_sep = "_") # separate at underscores
## A tibble: 6 × 5
# name Q num pre post
# <chr> <chr> <chr> <dbl> <dbl>
#1 Student 1 Q1 1 5 5
#2 Student 1 Q1 2 4 5
#3 Student 2 Q1 1 1 2
#4 Student 2 Q1 2 4 3
#5 Student 3 Q1 1 4 5
#6 Student 3 Q1 2 2 5
In this format, we could also very simply analyze what the average change was, or verify how many instances had both a pre and post value, or which students or questions were missing one or the other, etc. In this context, the pre/post observations are paired so closely that it's simpler to think of them as one meta-observation.
If the wide format is needed for reporting or further analysis, you could pivot wider again:
my_df |>
pivot_longer(-name, # pivot all columns except name
names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
names_sep = "_") |> # separate at underscores
mutate(imp = post > pre) |>
pivot_wider(names_from = c(Q, num),
values_from = pre:imp,
names_glue = "{Q}_{num}_{.value}", names_vary = "slowest")
# A tibble: 3 × 7
name Q1_1_pre Q1_1_post Q1_1_imp Q1_2_pre Q1_2_post Q1_2_imp
<chr> <dbl> <dbl> <lgl> <dbl> <dbl> <lgl>
1 Student 1 5 5 FALSE 4 5 TRUE
2 Student 2 1 2 TRUE 4 3 FALSE
3 Student 3 4 5 TRUE 2 5 TRUE
Or if you want to plot the data with ggplot2, it could help to make it long format where each test is in a different row:
my_df |>
pivot_longer(-name, # pivot all columns except name
names_to = c("Q", "num", ".value"), # name = Q + num + [label: pre/post]
names_sep = "_") |> # separate at underscores
mutate(imp = post > pre) |>
pivot_longer(pre:post, names_to = "time") |>
mutate(time = factor(time) |> fct_inorder()) |>
ggplot(aes(time, value, color = imp,
group = interaction(name, Q, num))) +
geom_line() +
facet_wrap(~name)