I have a dataset on county executives and their year of inaguration. I need break down which year each executive was inaugurated.
The problem is that the notation under the "year" variable is inconsistent.
For instance, let's say I start with this:
df <- data.frame(year= c(2000, "from 2001 to 2002", "01-feb-2003", 2000, "01-jan-2002", "from 2004 to 2005"),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
I want it to look like this
df.neat <- data.frame(year= c(2000, 2001, 2003, 2000, 2002, 2004),
executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
district= rep(c(1001, 1002), each=3))
Note how the innaguration cycle does not always align (2000, 2001, and 2003 for district 1001 and 2000, 2002, and 2004 for district 1002).
library(dplyr)
library(stringr)
df |>
mutate(year = as.numeric(str_extract(year, "\\d{4}")))
# year executive.name district
# 1 2000 Johnson 1001
# 2 2001 Smith 1001
# 3 2003 Alleghany 1001
# 4 2000 Roberts 1002
# 5 2002 Clarke 1002
# 6 2004 Tollson 1002