rtidyr

complete data frame with same number of rows per group


I’m struggling with an issue regarding the complete() function that I can’t resolve. I managed to have the desired behaviour another dataframe but I can’t make it work the same way for this one even though the structure of the data is similar. So here is the dataframe :

df2<-  structure(list(Season = c("2023/2024", "2023/2024", "2023/2024", 
"2023/2024", "2023/2024", "2023/2024", "2023/2024", "2023/2024", 
"2023/2024", "2023/2024", "2023/2024", "2023/2024", "2023/2024", 
"2023/2024", "2023/2024", "2023/2024", "2023/2024", "2023/2024", 
"2023/2024", "2023/2024", "2023/2024", "2023/2024"), DATEMM = structure(c(2023.58333333333, 
2023.58333333333, 2023.66666666667, 2023.66666666667, 2023.75, 
2023.75, 2023.83333333333, 2023.83333333333, 2023.91666666667, 
2023.91666666667, 2024, 2024, 2024.08333333333, 2024.08333333333, 
2024.16666666667, 2024.16666666667, 2024.25, 2024.25, 2024.33333333333, 
2024.33333333333, 2024.41666666667, 2024.41666666667), class = "yearmon"), 
    team = c("Brest", "Clermont", "Brest", "Clermont", "Brest", 
    "Clermont", "Brest", "Clermont", "Brest", "Clermont", "Brest", 
    "Clermont", "Brest", "Clermont", "Brest", "Clermont", "Brest", 
    "Clermont", "Brest", "Clermont", "Brest", "Clermont"), goal = c(0, 
    0, 5, 2, 8, 5, 11, 7, 14, 9, 25, 11, 29, 14, 34, 16, 37, 
    19, 49, 25, 53, 26), rank = c(7L, 9L, 7L, 17L, 12L, 17L, 
    12L, 17L, 12L, 17L, 4L, 18L, 3L, 18L, 4L, 18L, 6L, 18L, 3L, 
    18L, 3L, 18L), image_file = c("~/teamlg/Brest.png", "~/teamlg/Clermont.png", 
    "~/teamlg/Brest.png", "~/teamlg/Clermont.png", "~/teamlg/Brest.png", 
    "~/teamlg/Clermont.png", "~/teamlg/Brest.png", "~/teamlg/Clermont.png", 
    "~/teamlg/Brest.png", "~/teamlg/Clermont.png", "~/teamlg/Brest.png", 
    "~/teamlg/Clermont.png", "~/teamlg/Brest.png", "~/teamlg/Clermont.png", 
    "~/teamlg/Brest.png", "~/teamlg/Clermont.png", "~/teamlg/Brest.png", 
    "~/teamlg/Clermont.png", "~/teamlg/Brest.png", "~/teamlg/Clermont.png", 
    "~/teamlg/Brest.png", "~/teamlg/Clermont.png")), row.names = c(NA, 
-22L), class = c("tbl_df", "tbl", "data.frame"))

I want to complete DATEMM to have intermediate points so since it’s class yearmon I first convert to numerical values according to and then complete with the sequence :

fp_interp = 3
df3 <- df2 %>% 
  mutate(DATEMM_num = as.numeric(floor(12*DATEMM + .0001) / 12)) %>%
  complete(DATEMM_num = seq(min(DATEMM_num), max(DATEMM_num), (1/12)/fp_interp),
         nesting(team,image_file,Season))

I need each DATEMM_num to have the same number of rows. However when I run table(df3$DATEMM_num), I see that I don’t have the same number of entries for each, which should be the case : some have 2 when others have 4.

Thank you for your help.


Solution

  • There might be a rounding issue here. Here is a way to check:

    # CREATE DF2 LIKE YOU DID ALREADY
    
    fp_interp = 3
    
    # create df3 without completing the rows for now
    df3 <- df2 %>% 
      mutate(DATEMM_num = as.numeric(floor(12*DATEMM + .0001) / 12))
    
    # this is an array with the intermediate points
    date_with_intermediate_points <- seq(min(DATEMM_num), max(DATEMM_num), (1/12)/fp_interp)
    
    # create a df that keeps only df3$DATEMM_num that are absent from the array with intermediate points
    df3_check <- df3 %>% filter(!DATEMM_num %in% date_with_intermediate_points)
    

    You will see that some DATEMM_num differ from the calculated intermediate points, although when you eyeball them, they look the same: e.g. 2023.667, 2023.917.

    Here is a solution that is a bit more roundabout, but seems to work:

    fp_interp = 3
    # just calculate DATEMM_num, making sure to round it to 3 digits
    df3 <- df2 %>% 
      mutate(DATEMM_num = round(as.numeric(floor(12*DATEMM + .0001) / 12), 3)) 
    # create the array with intermediate points
    date_with_intermediate_points <- seq(min(DATEMM_num), max(DATEMM_num), (1/12)/fp_interp)
    # use it to create a data frame with these points and the "team" column
    df_list_of_dates <- data.frame(DATEMM_num = round(date_with_intermediate_points, 3),
                                   Clermont = 'Clermont',
                                   Brest = 'Brest') %>% 
      pivot_longer(cols = c('Clermont', 'Brest'), values_to="team") %>% 
      select(DATEMM_num, team)
    
    # Merge the two data frames
    df4 <- merge(df_list_of_dates, df3, 
                 by = c("DATEMM_num", "team"),
                 all = TRUE)
    

    I hope this helps!