rdplyrwide-column-store

Wide data frame with 4 columns to long data frame with 3 columns


I have a data frame (sample below), as follows:

df = structure(list(Stage1yBefore = c("3.1", "1", "4", "2", "NA"), 
Stage2yBefore = c("NA", "2", "3.2", "2", "NA"), ClinicalActivity1yBefore = 
c(TRUE, 
TRUE, TRUE, TRUE, FALSE), ClinicalActivity2yBefore = c(FALSE, 
TRUE, TRUE, TRUE, FALSE)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))

I would like to convert it to a long format using dplyr, but for some reason get an error.

The output should look like (converting the first row of df):

Output = data_frame(TimeFrame = c("1y", "2y"), Stage = c(3, NA), Clinical = 
c(T, F))

So that each row of df becomes 2 rows in the output.

What I tried doesnt work (and I'm actually not sure exactly how to do this):

Output = gather(df, TimeFrame, Stage, Clinical, Stage1yBefore:ClinicalActivity2yBefore)

I get:

Error in .f(.x[[i]],...): Object 'Clinical' not found.

Any ideas?


Solution

  • library(dplyr)
    library(stringr)
    library(tidyr)
    df %>% rownames_to_column() %>% 
           gather(TimeFrame, Stage, Stage1yBefore:ClinicalActivity2yBefore) %>% 
           #From TimeFrame extract a digit followed by y, also Stage or Clinical 
           mutate(Time=str_extract(TimeFrame,'\\dy'), Key=str_extract(TimeFrame,'Stage|Clinical')) %>% 
           dplyr::select(-TimeFrame) %>% 
           spread(Key,Stage)
    
    # A tibble: 10 x 4
      rowname Time  Clinical Stage
      <chr>   <chr> <chr>    <chr>
      1 1       1y    TRUE     3.1  
      2 1       2y    FALSE    NA   
      3 2       1y    TRUE     1    
      4 2       2y    TRUE     2    
      5 3       1y    TRUE     4    
      6 3       2y    TRUE     3.2  
      7 4       1y    TRUE     2    
      8 4       2y    TRUE     2    
      9 5       1y    FALSE    NA   
     10 5       2y    FALSE    NA