I have a dataframe of county executives and the year they were inaugurated. I am running a panel study with county-year as the unit of analysis. The date range is 2000 to 2004.
I will like to expand the df such that it lists who was the county executive during each year between the years 2000 and 2004, but some of the districts were created half-way through my period of analysis.
My starting point is this:
df <- data.frame(year= c(2000, 2001, 2003, 2000, 2002, 2004, 2003),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson", "Roland"),
party= c("PartyRed", "PartyYellow", "PartyGreen", "PartyYellow", "PartyOrange", "PartyRed", "PartyPurple"),
district= c(1001, 1001, 1001, 1002, 1002, 1002, 1003))
year executive.name party district
1 2000 Johnson PartyRed 1001
2 2001 Smith PartyYellow 1001
3 2003 Alleghany PartyGreen 1001
4 2000 Roberts PartyYellow 1002
5 2002 Clarke PartyOrange 1002
6 2004 Tollson PartyRed 1002
7 2003 Roland PartyPurple 1003
So I want my df to only include years that the district existed, like below:
df.neat <- data.frame(year= c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2003, 2004),
executive.name= c("Johnson", "Smith", "Smith", "Alleghany", "Alleghany", "Roberts", "Roberts", "Clarke", "Clarke", "Tollson", "Roland", "Roland"),
party= c("PartyRed", "PartyYellow", "PartyYellow", "PartyGreen", "PartyGreen", "PartyYellow", "PartyYellow", "PartyOrange", "PartyOrange", "PartyRed", "PartyPurple", "PartyPurple"),
district= c(1001, 1001, 1001, 1001, 1001, 1002, 1002, 1002, 1002, 1002, 1003, 1003))
> df.neat
year executive.name party district
1 2000 Johnson PartyRed 1001
2 2001 Smith PartyYellow 1001
3 2002 Smith PartyYellow 1001
4 2003 Alleghany PartyGreen 1001
5 2004 Alleghany PartyGreen 1001
6 2000 Roberts PartyYellow 1002
7 2001 Roberts PartyYellow 1002
8 2002 Clarke PartyOrange 1002
9 2003 Clarke PartyOrange 1002
10 2004 Tollson PartyRed 1002
11 2003 Roland PartyPurple 1003
12 2004 Roland PartyPurple 1003
Note how district 1003 was established in 2003. If I run the complete
command, it assumes that 2000, 2001, and 2002 are implicit NAs. And then fill
drags down the last result from district 1002 to fill in those new rows.
In other words, my df turns out looking like this:
df |>
tidyr::complete(district, year) |>
tidyr::fill(executive.name, party)
# A tibble: 15 × 4
district year executive.name party
<dbl> <dbl> <chr> <chr>
1 1001 2000 Johnson PartyRed
2 1001 2001 Smith PartyYellow
3 1001 2002 Smith PartyYellow
4 1001 2003 Alleghany PartyGreen
5 1001 2004 Alleghany PartyGreen
6 1002 2000 Roberts PartyYellow
7 1002 2001 Roberts PartyYellow
8 1002 2002 Clarke PartyOrange
9 1002 2003 Clarke PartyOrange
10 1002 2004 Tollson PartyRed
11 1003 2000 Tollson PartyRed
12 1003 2001 Tollson PartyRed
13 1003 2002 Tollson PartyRed
14 1003 2003 Roland PartyPurple
15 1003 2004 Roland PartyPurple
We can use statar::fill_gap()
:
library(dplyr)
df %>%
group_by(district) %>%
statar::fill_gap(year, full = TRUE) %>%
tidyr::fill(executive.name:party, .direction = "down") %>%
na.omit() %>%
ungroup()
#> # A tibble: 12 × 4
#> year executive.name party district
#> <dbl> <chr> <chr> <dbl>
#> 1 2000 Johnson PartyRed 1001
#> 2 2001 Smith PartyYellow 1001
#> 3 2002 Smith PartyYellow 1001
#> 4 2003 Alleghany PartyGreen 1001
#> 5 2004 Alleghany PartyGreen 1001
#> 6 2000 Roberts PartyYellow 1002
#> 7 2001 Roberts PartyYellow 1002
#> 8 2002 Clarke PartyOrange 1002
#> 9 2003 Clarke PartyOrange 1002
#> 10 2004 Tollson PartyRed 1002
#> 11 2003 Roland PartyPurple 1003
#> 12 2004 Roland PartyPurple 1003
Created on 2024-07-17 with reprex v2.0.2