rdataframemerger-collapse

Collapsing rows into lists for multiple columns based on identified duplicates in a group of other columns


I have a large dataset that is very similar to the following dummy dataset:

df = data.frame(coursecode = c("WBPH001","WBPH001","WBPH001","WBPH058","WBAS007"),
                 coursename = c("Mechanics","Mechanics","Mechanics", "Calculus 2","Introduction"),
                 courseurl = c("url1","url1","url1","url2","url3"),
                 programme_faculty = c("FSE","FSE","FSE", "FSE", "FSE"),
                 programme_name = c( "Mat","Bio","Ast","Ast","Ast"),
                 programme_ects = c("180", "180", "210", "180", "180")
                 )

Which makes this (all values are strings):

#> print(df):
  coursecode   coursename      courseurl    programme_faculty   programme_name   programme_ects
1    WBPH001    Mechanics      url1         FSE                 Mat              180
2    WBPH001    Mechanics      url1         FSE                 Bio              180
3    WBPH001    Mechanics      url1         FSE                 Ast              210
4    WBPH058    Calculus 2     url2         FSE                 Ast              180
5    WBAS007    Introduction   url3         FSE                 Ast              180

I have exported all the courses of a whole faculty, but some courses are listed in multiple programs (in this example for example "Mechanics" which is associated with the "Mat", "Bio" and "Ast" program.

In short, what I would like to achieve is that all these duplicate courses are removed, while preserving the program information (i.e. name, ects, faculty).

So if there is a duplicate identified in column "coursecode", "coursename", and "courseurl" it will automatically collapse program information (columns "programme_faculty", "programme name", and "programme_ects") in separate lists in each column

The dataset should then look like this:

#> print(modified_df):
     coursecode coursename     courseurl    programme_faculty        programme_name      programme_ects
1    WBPH001    Mechanics      url1        c(FSE, FSE, FSE)          c(Mat, Bio, Ast)    c(180, 180, 210)
2    WBPH058    Calculus 2     url2        FSE                       Ast                 180
3    WBAS007    Introduction   url3        FSE                       Ast                 180

Primarily the course information is used in downstream analysis, but it is important that it is always possible to retrieve the programs that a course is associated with. Therefore I need such a dataframe, but I cannot seem to find out which functions have to be used to achieve this.

It is important that the strings are not simply collapsed together and separated with something like " | ".

I have tried functions like aggregate(), collapse(), and other suggestions from other stackoverflow queries but their solutions did not work for my specific dataset.


Solution

  • You could group_by on the columns and summarise these groups across the columns you want to merge by collapsing with paste like this:

    library(dplyr)
    df %>%
      group_by(coursecode, coursename, courseurl) %>%
      summarise(across(programme_faculty:programme_ects, ~ paste(.x, collapse = ", ")))
    #> # A tibble: 3 × 6
    #> # Groups:   coursecode, coursename [3]
    #>   coursecode coursename   courseurl programme_faculty programme_name programme…¹
    #>   <chr>      <chr>        <chr>     <chr>             <chr>          <chr>      
    #> 1 WBAS007    Introduction url3      FSE               Ast            180        
    #> 2 WBPH001    Mechanics    url1      FSE, FSE, FSE     Mat, Bio, Ast  180, 180, …
    #> 3 WBPH058    Calculus 2   url2      FSE               Ast            180        
    #> # … with abbreviated variable name ¹​programme_ects
    

    You could also list them like this:

    library(dplyr)
    df %>%
      group_by(coursecode, coursename, courseurl) %>%
      summarise(across(programme_faculty:programme_ects, ~ list(.x)))
    #> # A tibble: 3 × 6
    #> # Groups:   coursecode, coursename [3]
    #>   coursecode coursename   courseurl programme_faculty programme_name programme…¹
    #>   <chr>      <chr>        <chr>     <list>            <list>         <list>     
    #> 1 WBAS007    Introduction url3      <chr [1]>         <chr [1]>      <chr [1]>  
    #> 2 WBPH001    Mechanics    url1      <chr [3]>         <chr [3]>      <chr [3]>  
    #> 3 WBPH058    Calculus 2   url2      <chr [1]>         <chr [1]>      <chr [1]>  
    #> # … with abbreviated variable name ¹​programme_ects
    

    Created on 2023-03-25 with reprex v2.0.2

    As said by @zephryl you could replace ~list(.x) with just list.