I have a table of data with studentIDs,TestingWindow, and BenchmarkCategories. Values for TestingWindow are "Start of Year" or "End of Year." Values for the TestingData are Urgent Intervention, Intervention, On Watch, At Benchmark, Above Benchmark.
Here's some sample data. Notice student 123 is missing a result of End of Year.
studentIDs | ScreeningPeriodWindowName <chr> | DistrictBenchmarkCategoryName <chr> |
---|---|---|
123 | StartofYear | Urgent Intervention |
456 | StartofYear | Intervention |
456 | EndofYear | At Benchmark |
789 | StartofYear | On Watch |
789 | EndofYear | Above Benchmark |
I want to display the movement of students between BenchmarkCategories between the start and end of year. I'd like a table to show this. The key is to show changes for individual students, rather than total counts. "How many students moved from Urgent Intervention to On Watch?"
Here's what the table should look like, with Column A the start of year categories and Row 1 the End of Year Categories. (middle cells blank but I think you see the pattern)
v start end> | Urgent Intervention | Intervention | On Watch | At Benchmark | Above Benchmark |
---|---|---|---|---|---|
Urgent Intervention | count of Urgent to Urgent | count of Urgent to Intervention | count of Urgent to On Watch | Count of Urgent to At Benchmark | count of Urgent to Above Benchmark |
Intervention | |||||
On Watch | |||||
At Benchmark | |||||
Above Benchmark | count of Above to Urgent | count of Above to Intervention | count of Above to On Watch | count of Above to At Benchmark | count of Above to Above |
In Excel pivot tables this is trivial (defining rows and columns, count IDs) but I am having a hard time wrapping my head around it in terms of grouping and summarizing (or pivot-tabling) in R. Thanks.
I have tried pivot_wider and pivot_longer, but each only gets me the row or column headings and one set of counts.
progress <- reading_data %>%
select(studentID,TestingWindow,BenchmarkCategories) %>%
group_by(TestingWindow,BenchmarkCategories) %>%
summarize(count=n_distinct(studentID)) %>%
pivot_wider(names_from=TestingWindow,values_from=count)
progress
You can do this using two pivot_wider()
calls:
pivot_wider()
to create separate columns for each value in TestingWindow e.g. StartofYear and EndofYearpivot_wider()
again so EndofYear values are pivoted to columnsNote that if you want all BenchmarkCategories to be included in the result, you'll need an extra step as detailed below. I have also included a larger example dataset to illustrate results.
First, with your sample data (with modified column names to match your code pipe column names):
library(dplyr)
library(tidyr)
reading_data <- structure(list(studentID = c(123L, 456L, 456L, 789L, 789L), TestingWindow = c("StartofYear",
"StartofYear", "EndofYear", "StartofYear", "EndofYear"), BenchmarkCategories = c("Urgent Intervention",
"Intervention", "At Benchmark", "On Watch", "Above Benchmark"
)), class = "data.frame", row.names = c(NA, -5L))
progress <- reading_data |>
pivot_wider(names_from = TestingWindow,
values_from = BenchmarkCategories) |>
count(StartofYear, EndofYear, .drop = FALSE) |>
pivot_wider(names_from = EndofYear,
values_from = n,
values_fill = 0) |>
rename(`v start end >` = StartofYear)
progress
# # A tibble: 3 × 4
# `v start end >` `At Benchmark` `Above Benchmark` `NA`
# <chr> <int> <int> <int>
# 1 Intervention 1 0 0
# 2 On Watch 0 1 0
# 3 Urgent Intervention 0 0 1
Pairs not represented are dropped so not all BenchmarkCategories are present. Also, the NA column counts the incomplete records e.g. studentID == 123.
If you want all BenchmarkCategories in the result:
# Create vector of all unique BenchmarkCategories
bmc <- unique(reading_data$BenchmarkCategories)
progress <- reading_data |>
pivot_wider(names_from = TestingWindow,
values_from = BenchmarkCategories) |>
count(StartofYear, EndofYear, .drop = FALSE) |>
complete(StartofYear = bmc,
EndofYear = bmc,
fill = list(n = 0)) |>
pivot_wider(names_from = EndofYear,
values_from = n,
values_fill = 0) |>
rename(`v start end >` = StartofYear)
progress
# # A tibble: 5 × 7
# StartofYear `Above Benchmark` `At Benchmark` Intervention `On Watch` `Urgent Intervention` `NA`
# <chr> <int> <int> <int> <int> <int> <int>
# 1 Above Benchmark 0 0 0 0 0 0
# 2 At Benchmark 0 0 0 0 0 0
# 3 Intervention 0 1 0 0 0 0
# 4 On Watch 1 0 0 0 0 0
# 5 Urgent Intervention 0 0 0 0 0 1
Example using larger example dataset:
set.seed(42)
reading_data <- data.frame(
studentID = c(123, rep(456:789, each = 2)),
TestingWindow = c("StartofYear", rep(c("StartofYear", "EndofYear"), 334)),
BenchmarkCategories = sample(c("Urgent Intervention","Intervention",
"At Benchmark", "On Watch", "Above Benchmark"),
669, replace = TRUE))
progress <- reading_data |>
pivot_wider(names_from = TestingWindow,
values_from = BenchmarkCategories) |>
count(StartofYear, EndofYear, .drop = FALSE) |>
# complete(StartofYear = bmc,
# EndofYear = bmc,
# fill = list(n = 0)) |>
pivot_wider(names_from = EndofYear,
values_from = n,
values_fill = 0) |>
rename(`v start end >` = StartofYear)
progress
# # A tibble: 5 × 7
# `v start end >` `Above Benchmark` `At Benchmark` Intervention `On Watch` `Urgent Intervention` `NA`
# <chr> <int> <int> <int> <int> <int> <int>
# 1 Above Benchmark 12 8 13 13 8 0
# 2 At Benchmark 14 9 11 12 15 0
# 3 Intervention 18 17 18 15 20 0
# 4 On Watch 14 10 14 7 15 0
# 5 Urgent Intervention 11 21 14 17 8 1