rdplyrtidyrna

Fill gaps in data frame beginning with the first non-NA observation and ending with the last non-NA observation by group


My data frame looks like this:

df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA)),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

I want to fill the gaps in the data based on the last non-NA observation per group (country) until the last non-NA observation per group. Accordingly, the value of gov_id in the rows "1", "6", "7" and "12" should remain "NA". Consequently, the data frame should look like this in the end:

df_new <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, 1, 1, 2, NA, NA, 12, 12, 12, 13, NA)),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

I already tried

library(tidyr)

fill(gov_id, direction = "up")  

and similar options, but the problem is that it overwrites NA's at the beginning and/or the end of a group, that should remain NA's.

I am thankful for any suggestions that also work on a larger scale!


Solution

  • I think you'll need to "group" your dataframe by country before doing fill (and note the direction should be "down" instead of "up", as demonstrated in your df_new). A quick fix to your fill approach is to turn the last block of NA back to NA.

    First create reproducible example, the last two rows of "Hungary" are NA.

    df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", 
    "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", 
    "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
    2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005, 2006
    ), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA, NA
    )), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
    "6", "7", "8", "9", "10", "11", "12", "13"))
    

    Code:

    library(tidyverse)
    
    df %>% 
      group_by(country) %>% 
      mutate(flag = cumsum(!is.na(gov_id)),
             gov_id2 = gov_id) %>% 
      fill(gov_id, .direction = "down") %>% 
      mutate(gov_id = ifelse(flag == max(flag) & is.na(gov_id2), NA, gov_id)) %>% 
      select(-flag, -gov_id2) %>% 
      ungroup()
    
    # A tibble: 13 × 3
       country   year gov_id
       <chr>    <dbl>  <dbl>
     1 Slovenia  2000     NA
     2 Slovenia  2001      1
     3 Slovenia  2002      1
     4 Slovenia  2003      1
     5 Slovenia  2004      2
     6 Slovenia  2005     NA
     7 Hungary   2000     NA
     8 Hungary   2001     12
     9 Hungary   2002     12
    10 Hungary   2003     12
    11 Hungary   2004     13
    12 Hungary   2005     NA
    13 Hungary   2006     NA