Fill gaps in data frame beginning with the first non-NA observation and ending with the last non-NA observation by group

My data frame looks like this:

df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA)),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

I want to fill the gaps in the data based on the last non-NA observation per group (country) until the last non-NA observation per group. Accordingly, the value of gov_id in the rows "1", "6", "7" and "12" should remain "NA". Consequently, the data frame should look like this in the end:

df_new <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, 1, 1, 2, NA, NA, 12, 12, 12, 13, NA)),
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

I already tried

library(tidyr)

fill(gov_id, direction = "up")

and similar options, but the problem is that it overwrites NA's at the beginning and/or the end of a group, that should remain NA's.

I am thankful for any suggestions that also work on a larger scale!

Solution

I think you'll need to "group" your dataframe by country before doing fill (and note the direction should be "down" instead of "up", as demonstrated in your df_new). A quick fix to your fill approach is to turn the last block of NA back to NA.

First create reproducible example, the last two rows of "Hungary" are NA.

df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", 
"Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", 
"Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001, 
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005, 2006
), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA, NA
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12", "13"))

Code:

library(tidyverse)

df %>% 
  group_by(country) %>% 
  mutate(flag = cumsum(!is.na(gov_id)),
         gov_id2 = gov_id) %>% 
  fill(gov_id, .direction = "down") %>% 
  mutate(gov_id = ifelse(flag == max(flag) & is.na(gov_id2), NA, gov_id)) %>% 
  select(-flag, -gov_id2) %>% 
  ungroup()

# A tibble: 13 × 3
   country   year gov_id
   <chr>    <dbl>  <dbl>
 1 Slovenia  2000     NA
 2 Slovenia  2001      1
 3 Slovenia  2002      1
 4 Slovenia  2003      1
 5 Slovenia  2004      2
 6 Slovenia  2005     NA
 7 Hungary   2000     NA
 8 Hungary   2001     12
 9 Hungary   2002     12
10 Hungary   2003     12
11 Hungary   2004     13
12 Hungary   2005     NA
13 Hungary   2006     NA