My data frame looks like this:
df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001,
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
I want to fill the gaps in the data based on the last non-NA observation per group (country) until the last non-NA observation per group. Accordingly, the value of gov_id in the rows "1", "6", "7" and "12" should remain "NA". Consequently, the data frame should look like this in the end:
df_new <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001,
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005), gov_id = c(NA, 1, 1, 1, 2, NA, NA, 12, 12, 12, 13, NA)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
I already tried
library(tidyr)
fill(gov_id, direction = "up")
and similar options, but the problem is that it overwrites NA's at the beginning and/or the end of a group, that should remain NA's.
I am thankful for any suggestions that also work on a larger scale!
I think you'll need to "group" your dataframe by country
before doing fill
(and note the direction should be "down" instead of "up", as demonstrated in your df_new
). A quick fix to your fill
approach is to turn the last block of NA
back to NA
.
First create reproducible example, the last two rows of "Hungary" are NA
.
df <- structure(list(country = c("Slovenia", "Slovenia", "Slovenia",
"Slovenia", "Slovenia", "Slovenia", "Hungary", "Hungary", "Hungary",
"Hungary", "Hungary", "Hungary", "Hungary"), year = c(2000, 2001,
2002, 2003, 2004, 2005, 2000, 2001, 2002, 2003, 2004, 2005, 2006
), gov_id = c(NA, 1, NA, NA, 2, NA, NA, 12, NA, NA, 13, NA, NA
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12", "13"))
Code:
library(tidyverse)
df %>%
group_by(country) %>%
mutate(flag = cumsum(!is.na(gov_id)),
gov_id2 = gov_id) %>%
fill(gov_id, .direction = "down") %>%
mutate(gov_id = ifelse(flag == max(flag) & is.na(gov_id2), NA, gov_id)) %>%
select(-flag, -gov_id2) %>%
ungroup()
# A tibble: 13 × 3
country year gov_id
<chr> <dbl> <dbl>
1 Slovenia 2000 NA
2 Slovenia 2001 1
3 Slovenia 2002 1
4 Slovenia 2003 1
5 Slovenia 2004 2
6 Slovenia 2005 NA
7 Hungary 2000 NA
8 Hungary 2001 12
9 Hungary 2002 12
10 Hungary 2003 12
11 Hungary 2004 13
12 Hungary 2005 NA
13 Hungary 2006 NA