I need to get the year directly above the null (the earliest file date that has a year value) for each person and then fill in the missing values with that year. I have a method that works using subsetting and assigning values, but it requires looping over the data and with billions of rows this is not feasible. Below there is a mock dataset which shows how the data is structured. Each ID has multiple file dates and may have several years associated with it. For file dates before 3/1/14, the year column is missing.
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2),
file_date = c("12/1/14", "9/1/14", "6/1/14", "3/1/14", "12/1/13", "9/1/13", "9/1/14", "6/1/14", "3/1/14", "12/1/13"),
year = c(1979, 1965, 1965, 1965, NA, NA, 1982, 1982, 1982, NA)
)
In this data, the first two null values would be filled with 1965 and the last with 1982.
Since you cant use tidyr::fill()
, you can try the following approaches using ave
.
If you have access to the zoo
package, this is quite simple:
# zoo::na.locf
df$Filled_Year_zoo <- ave(df$year, df$ID, FUN = zoo::na.locf0) # thanks to @G. Grothendieck
If not, a completely base R approach could be:
### Full base R
df$Filled_Year_base <- ave(df$year, df$ID,
FUN = \(x) {
x[which(is.na(x))] <- tail(x[which(!is.na(x))], 1)
x})
Result:
# ID file_date year Filled_Year_zoo Filled_Year_base
# 1 1 12/1/14 1979 1979 1979
# 2 1 9/1/14 1965 1965 1965
# 3 1 6/1/14 1965 1965 1965
# 4 1 3/1/14 1965 1965 1965
# 5 1 12/1/13 NA 1965 1965
# 6 1 9/1/13 NA 1965 1965
# 7 2 9/1/14 1982 1982 1982
# 8 2 6/1/14 1982 1982 1982
# 9 2 3/1/14 1982 1982 1982
# 10 2 12/1/13 NA 1982 1982
(you of course can overwrite the year
column, I was just defining new columns for demonstration)