rdatabasereformatting

Reformate data based on pattern in R


I hope you can help me with this problem, I have the following data like this:

ID,colour
1,base_yellow
1,blue
1,base_red
1,blue
1,pink
1,blue
1,base_yellow
2,base_yellow
2,blue
2,base_red
2,blue
2,pink
2,blue
2,base_yellow
3,base_yellow
3,blue
3,pink
3,blue
3,base_yellow
4,base_yellow
4,blue
4,green
4,blue
4,green
4,blue
4,pink
4,blue
4,base_yellow

Every time meet with base (base_yellow, base_red), it creates new group, the output that is expected as shown below, which gives a new variable:

ID,colour
1,base_yellow; blue; base_red
1,base_red; blue; pink;blue;base_yellow
2,base_yellow; blue; base_red
2,base_red; blue; pink;blue; base_yellow
3,base_yellow;blue;pinkblue;base_yellow
4,base_yellow; blue;green;blue;green;blue;pink;blue;base_yellow

Solution

  • This is something you might be able to adapt for your needs.

    First, create a vector vec that includes row positions where colour starts with "base".

    Then, you can use map2_dfr from purrr that will provide colour that ranges from start to end positions based on vec. This will help with situations where the same colour is used in more than one row in the end. A grouping variable group is also created in this step.

    After grouping by group, you can keep only colour groups that have more than one colour and str_c to collapse them together for the same group.

    library(tidyverse)
    
    vec <- which(grepl("^base", df$colour))
    
    map2_dfr(
      vec[-length(vec)],
      vec[-1],
      ~df[.x:.y, ],
      .id = "group"
    ) %>%
      group_by(group) %>%
      filter(n_distinct(colour) > 1) %>%
      summarise(ID = first(ID), colour = str_c(colour, collapse = "; ")) %>%
      select(-group)
    

    Output

         ID colour                                                              
      <int> <chr>                                                               
    1     1 base_yellow; blue; base_red                                         
    2     1 base_red; blue; pink; blue; base_yellow                             
    3     2 base_yellow; blue; base_red                                         
    4     2 base_red; blue; pink; blue; base_yellow                             
    5     3 base_yellow; blue; pink; blue; base_yellow                          
    6     4 base_yellow; blue; green; blue; green; blue; pink; blue; base_yellow