I have a large dataset that is very similar to the following dummy dataset:
df = data.frame(coursecode = c("WBPH001","WBPH001","WBPH001","WBPH058","WBAS007"),
coursename = c("Mechanics","Mechanics","Mechanics", "Calculus 2","Introduction"),
courseurl = c("url1","url1","url1","url2","url3"),
programme_faculty = c("FSE","FSE","FSE", "FSE", "FSE"),
programme_name = c( "Mat","Bio","Ast","Ast","Ast"),
programme_ects = c("180", "180", "210", "180", "180")
)
Which makes this (all values are strings):
#> print(df):
coursecode coursename courseurl programme_faculty programme_name programme_ects
1 WBPH001 Mechanics url1 FSE Mat 180
2 WBPH001 Mechanics url1 FSE Bio 180
3 WBPH001 Mechanics url1 FSE Ast 210
4 WBPH058 Calculus 2 url2 FSE Ast 180
5 WBAS007 Introduction url3 FSE Ast 180
I have exported all the courses of a whole faculty, but some courses are listed in multiple programs (in this example for example "Mechanics" which is associated with the "Mat", "Bio" and "Ast" program.
In short, what I would like to achieve is that all these duplicate courses are removed, while preserving the program information (i.e. name, ects, faculty).
So if there is a duplicate identified in column "coursecode", "coursename", and "courseurl" it will automatically collapse program information (columns "programme_faculty", "programme name", and "programme_ects") in separate lists in each column
The dataset should then look like this:
#> print(modified_df):
coursecode coursename courseurl programme_faculty programme_name programme_ects
1 WBPH001 Mechanics url1 c(FSE, FSE, FSE) c(Mat, Bio, Ast) c(180, 180, 210)
2 WBPH058 Calculus 2 url2 FSE Ast 180
3 WBAS007 Introduction url3 FSE Ast 180
Primarily the course information is used in downstream analysis, but it is important that it is always possible to retrieve the programs that a course is associated with. Therefore I need such a dataframe, but I cannot seem to find out which functions have to be used to achieve this.
It is important that the strings are not simply collapsed together and separated with something like " | ".
I have tried functions like aggregate(), collapse(), and other suggestions from other stackoverflow queries but their solutions did not work for my specific dataset.
You could group_by
on the columns and summarise
these groups across
the columns you want to merge by collapsing with paste
like this:
library(dplyr)
df %>%
group_by(coursecode, coursename, courseurl) %>%
summarise(across(programme_faculty:programme_ects, ~ paste(.x, collapse = ", ")))
#> # A tibble: 3 × 6
#> # Groups: coursecode, coursename [3]
#> coursecode coursename courseurl programme_faculty programme_name programme…¹
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 WBAS007 Introduction url3 FSE Ast 180
#> 2 WBPH001 Mechanics url1 FSE, FSE, FSE Mat, Bio, Ast 180, 180, …
#> 3 WBPH058 Calculus 2 url2 FSE Ast 180
#> # … with abbreviated variable name ¹programme_ects
You could also list
them like this:
library(dplyr)
df %>%
group_by(coursecode, coursename, courseurl) %>%
summarise(across(programme_faculty:programme_ects, ~ list(.x)))
#> # A tibble: 3 × 6
#> # Groups: coursecode, coursename [3]
#> coursecode coursename courseurl programme_faculty programme_name programme…¹
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 WBAS007 Introduction url3 <chr [1]> <chr [1]> <chr [1]>
#> 2 WBPH001 Mechanics url1 <chr [3]> <chr [3]> <chr [3]>
#> 3 WBPH058 Calculus 2 url2 <chr [1]> <chr [1]> <chr [1]>
#> # … with abbreviated variable name ¹programme_ects
Created on 2023-03-25 with reprex v2.0.2
As said by @zephryl you could replace ~list(.x)
with just list
.