I have a dataset that has course completions from a variety of subject areas for a very large number of students. Each row of data is organized as follows:
Student ID | Gender | City | Course | ETC. |
---|---|---|---|---|
10102 | Male | City 1 | Science | |
10304 | Female | City 2 | Science | |
10304 | Female | City 2 | Math | |
10506 | Male | City 3 | English | |
10506 | Male | City 3 | French | |
10506 | Male | City 3 | Science |
I would like to reorganize the dataset to have a single row of data for each unique student ID and then have all of the courses that student took added to a list variable for that student.
I've used R for this dataset as it's very large (~1.5 million rows) and it seems to be effective for that whereas Excel or SPSS are not.
I've used the filter()
function for a few other things, but I don't think it'll work in this case because I don't want duplicates of each student ID. Just one that has each unique course code they have taken listed inside of it.
We can use dplyr
here along with group_by
:
library(dplyr)
df_out <- df %>%
group_by(`Student ID`) %>%
summarise(Course = list(Course), .groups = 'drop')
df_out
# A Tibble: 3 × 2
`Student ID` Course
<dbl> <list>
1 10102 <chr [1]>
2 10304 <chr [2]>
3 10506 <chr [3]>