rdplyrdata-wrangling

How to regroup and mutate a dataframe according to specific conditions?


I have a dataframe that looks like this:

example <- data.frame(
  date = as.Date(c('2023-02-12', '2023-02-13', '2023-02-14',
           '2023-02-12', '2023-02-13', '2023-02-14',
           '2023-03-06', '2023-03-07', '2023-03-08',
           '2023-03-07', '2023-03-08', '2023-03-09')),
  PID = c(1091, 1091, 1091,
          2091, 2091, 2091,
          1137, 1137, 1137, 
          2137, 2137, 2137),
  give = c('a', 'c', 'e',
           'aa', 'cc', 'ee',
           'aaa', 'ccc', 'eee',
           'aaaa', 'cccc', 'eeee'),
  receive = c('b', 'd', 'f',
              'bb', 'dd', 'ff',
              'bbb', 'ddd', 'fff',
              'bbbb', 'dddd', 'ffff')
)

I want to regroup and relabel the columns in the dataframe such that this is the output:

  example_solution <- data.frame(

  date = as.Date(c('2023-02-12', '2023-02-13', '2023-02-14',
                   '2023-03-06', '2023-03-07', '2023-03-08', '2023-03-09')),
  CID = c(91, 91, 91, 
          137, 137, 137, 137),
  PID_A = c(1091, 1091, 1091,
          1137, 1137, 1137, NA),
  PID_B = c(2091, 2091, 2091,
            NA, 2137, 2137, 2137),
  give_A = c('a', 'c', 'e',
             'aaa', 'ccc', 'eee', NA),
  receive_B = c('bb', 'dd', 'ff',
                NA, 'bbbb', 'dddd', 'ffff'),
  give_B = c('aa', 'cc', 'ee',
           NA, 'aaaa', 'cccc', 'eeee'),
  receive_A = c('b', 'd', 'f',
              'bbb', 'ddd', 'fff', NA)
)

I have some functional code so far that does most of the job:

example <- example %>% 
  group_by(CID = sub('.', '', PID), date) %>% 
  mutate(name = LETTERS[1:n()]) %>% 
  pivot_wider(values_from = PID:receive) %>%
  relocate(receive_A, .after = give_B) %>%
  relocate(receive_B, .after = give_A)

The only problem is that when there is a date where there is only an entry for the CID with the PID starting with 2 and not 1 (i.e. 2137 on date 2023-03-09 in this case), 2137 is grouped as PID_A instead of PID_B. ffff is also grouped under give_A instead of receive_B and eeee is grouped under receive_A instead of give_B.

Essentially, data for all PIDs that start with 2 should always be associated the column names whatevername_B, even if there is no corresponding PID that start with 1 on the same date.

Does anyone know how I can get to the desired output?


Solution

  • The value of the column name is defined using n(), which evaluates to 1 in the case of the date 2023-03-09 since there is only one row inside of example corresponding to this date. Hence, the suffix _A gets assigned.

    If the requirement is "data for all PIDs that start with 2 should always be associated the column names whatevername_B" then we have to take this into account when defining name. This can be achieved e.g. by replacing mutate(name = LETTERS[1:n()]):

    example %>%
        group_by(CID = sub('.', '', PID), date) %>%
        mutate(name = case_when(substr(PID, 1, 1) == "2" ~ "B",
                                TRUE ~ LETTERS[1:n()])) %>%
        pivot_wider(values_from = PID:receive) %>%
        relocate(receive_A, .after = give_B) %>%
        relocate(receive_B, .after = give_A)
    

    Then the output will look like:

    # A tibble: 7 × 8
    # Groups:   CID, date [7]
      date       CID   PID_A PID_B give_A receive_B give_B receive_A
      <date>     <chr> <dbl> <dbl> <chr>  <chr>     <chr>  <chr>    
    1 2023-02-12 091    1091  2091 a      bb        aa     b        
    2 2023-02-13 091    1091  2091 c      dd        cc     d        
    3 2023-02-14 091    1091  2091 e      ff        ee     f        
    4 2023-03-06 137    1137    NA aaa    NA        NA     bbb      
    5 2023-03-07 137    1137  2137 ccc    bbbb      aaaa   ddd      
    6 2023-03-08 137    1137  2137 eee    dddd      cccc   fff      
    7 2023-03-09 137      NA  2137 NA     ffff      eeee   NA