rdplyr

How to reorganize data based on unique cases of a variable?


I have a dataset that has course completions from a variety of subject areas for a very large number of students. Each row of data is organized as follows:

Student ID Gender City Course ETC.
10102 Male City 1 Science
10304 Female City 2 Science
10304 Female City 2 Math
10506 Male City 3 English
10506 Male City 3 French
10506 Male City 3 Science

I would like to reorganize the dataset to have a single row of data for each unique student ID and then have all of the courses that student took added to a list variable for that student.

I've used R for this dataset as it's very large (~1.5 million rows) and it seems to be effective for that whereas Excel or SPSS are not.

I've used the filter() function for a few other things, but I don't think it'll work in this case because I don't want duplicates of each student ID. Just one that has each unique course code they have taken listed inside of it.


Solution

  • We can use dplyr here along with group_by:

    library(dplyr)
    
    df_out <- df %>%
        group_by(`Student ID`) %>%
        summarise(Course = list(Course), .groups = 'drop')
    
    df_out
    
    # A Tibble: 3 × 2
      `Student ID` Course   
             <dbl> <list>   
    1        10102 <chr [1]>
    2        10304 <chr [2]>
    3        10506 <chr [3]>