rdataframedplyrtidyversedata-manipulation

Dplyr: Rearrange a Dataframe by Putting Multiple Factors in One Cell in Multiple Columns into Separate Individual Rows into one Column using R


Overview

I have a data frame with 444 rows and 361 columns. In 3 of the columns named 'Collaborations', 'Mini Collaborations', and 'Subcontractors', I've inputted multiple factors showing the names of collaborators, all inputted into one cell separated by commas. Some cells can contain up to 30 names.

Aim:

My ultimate aim is to rearrange the data frame by manipulating the factors (separated by a comma) in every single cell (see Table 1 - column titles'Collaborations', 'Mini Collaborations', and 'Subcontractors'), and place all the information from all three columns them into individual separate rows (see Table 2) in one single column called 'Collaborations'.

Problem:

In some single cells, there are no names, just the value zero, and, I don't want to output multiple rows within the new column 'Collaborations' with zeros.

In the future, I want to be able to filter the new data frame that's been outputted by each collaborator's name to track the dates they've worked and their work history.

I apologise for not providing a sample code, but I'm not an expert with R. I can use dplyr and data.table, but I've tried hard to find a way and I've been unsuccessful.

This question Split comma-separated strings in a column into separate rows was flagged up to me, and it was very helpful in rearranging one column using this code:

Df %>% mutate(Collaborations = strsplit(as.character(Collaborations), ",")) %>% unnest(Collaborations)

However, I'm not sure how to rearrange three columns this way, put them into one column, and, delete the columns that contain zeros so they are not duplicated into the new column 'Collaborators'.

I've provided a sample data frame below. If anyone can help, I'd be appreciative.

Table 1

enter image description here

Table 2

enter image description here

Dataframe

 structure(list(Date = structure(c(1L, 3L, 4L, 5L, 6L, 2L), levels = c("1/1/23", 
"10/1/23", "2/1/23", "4/1/23", "6/1/23", "8/1/23"), class = "factor"), 
    Start_Time = structure(c(6L, 4L, 3L, 5L, 1L, 2L), levels = c("4:30", 
    "5:13", "7:30", "8:25", "8:45", "9:10"), class = "factor"), 
    End_Time = structure(c(2L, 1L, 4L, 5L, 6L, 3L), levels = c("12:30", 
    "14:15", "14:20", "15:25", "16:17", "18:17"), class = "factor"), 
    Project = structure(c(6L, 1L, 2L, 5L, 3L, 4L), levels = c("Business_Ltd", 
    "Clyde_Ltd", "Fabian_Ltd", "Friends_Ltd", "Greenhouse_Ltd", 
    "Red_Ltd"), class = "factor"), Collaborations = structure(c(2L, 
    6L, 1L, 4L, 5L, 3L), levels = c("FRANK, TESSA", "JON, SAM, LEAH", 
    "LEO, TIA", "LOUISE, TINA", "PATTY, JOAN", "PIERRE, LESLEY"
    ), class = "factor"), Mini_Collborations = structure(c(3L, 
    5L, 1L, 1L, 4L, 2L), levels = c("0", "GEORGE", "MEL", "TILLY", 
    "TONY, JIM"), class = "factor"), Subcontractors = structure(c(5L, 
    2L, 1L, 1L, 4L, 3L), levels = c("0", "CHAN, BEN", "KIM, UNA", 
    "KIRK, TRACY, KELLY", "SKYLER, TEESCHA"), class = "factor")), class = "data.frame", row.names = c(NA, 
-6L))

Solution

  • This might be a solution using dplyr and tidyr to transform your data into long format and remove the "0" entries:

    df |> 
        pivot_longer(cols = c(Collaborations, Mini_Collborations, Subcontractors),
                     names_to = "Role",
                     values_to = "Collaborations") |> 
        separate_rows(Collaborations, sep = ", ") |> 
        select(-Role) |> 
        filter(Collaborations != "0")
    

    Note that I'm using the native pipe operator |>, but %>% will work equally.