Long format data: Calculate NA for year x as row mean of other years

I have a pretty large longitudinal data set ranging from 2014 to 2021. Most of the variables are available for every years. However there are a few variables that are available for 2014 and 2016, but not for 2015. In these cases, I want to calculate the value for 2015 as the mean of the value from 2014 and 2016.

So the data structure looks as follows. Note this is extremely simplified, the data set has way more variables and observation. Also for each respondent there are rows for the other years as well (obviously), which I didnt write down here

PID	Year	Var 1	Var 2	Var 3
1	2014	10	2	2
1	2015	15	8	NA
1	2016	12	6	4
2	2014	11	7	5
2	2015	16	3	NA
2	2016	14	5	9

PID is the id/number that identifies each respondent. Var1 and Var2 are available for every year, Var3 is only available in 2014 and 2015

What I want is this:

PID	Year	Var 1	Var 2	Var 3
1	2014	10	2	2
1	2015	15	8	3
1	2016	12	6	4
2	2014	11	7	5
2	2015	16	3	7
2	2016	14	5	9

For Var3, instead of NA, the row for 2015 contains the mean of the value in 2014 and 2016. How can I achieve this?

My first ideas was to adress the missing values in 2015 by is.na() but this would address all the NAs in the whole data set and not just the NAs in 2015 for Var2. How can I adress these NAs specifically, so that it a) only calculates the value for 2015 as mean of 2014 and 2016 for Var2 and b) only for those rows where PID is the same, so that values of different respondents do not get mixed up?

Solution

What you're asking about is imputation, where there are different methods for how to replace empty/null/NA values. One such method involves the regressed value based on one or more other non-empty variables.

I'm actually taking a little liberty with this ... you said "mean of 2014 and 2016", which in this case is going to be the same, but if for some reason you have 2014 and 2017 and are missing 2015-2016, then "mean" will be biased and be the same value for both 2015-2016. approximation works well here.

dplyr

library(dplyr)
quux %>%
  mutate(
    across(-Year, ~ coalesce(.x, approx(Year, .x, xout = Year)$y)),
    .by = PID
  )
#   PID Year Var 1 Var 2 Var 3
# 1   1 2014    10     2     2
# 2   1 2015    15     8     3
# 3   1 2016    12     6     4
# 4   2 2014    11     7     5
# 5   2 2015    16     3     7
# 6   2 2016    14     5     9

coalesce is logically the same as ifelse(is.na(.x), approx(..)$y, .x), which means that if it is not NA then the original value will always be used. Frankly, we could likely just do

... across(-Year, ~ approx(Year, .x, xout = Year)$y)

(removing the coalesce, always replacing all values) which still results in the same desired results. But since I don't know the ins and outs of your data, I thought I would be safer about the process.

base R

quux[,-(1:2)] <- lapply(
  quux[,-(1:2)],
  function(V) with(quux,
    ave(as.numeric(1:nrow(quux)), PID,
        FUN = function(i) ifelse(is.na(V[i]), approx(Year[i], V[i], xout=Year[i])$y, V[i]))
  )
)
quux
#   PID Year Var 1 Var 2 Var 3
# 1   1 2014    10     2     2
# 2   1 2015    15     8     3
# 3   1 2016    12     6     4
# 4   2 2014    11     7     5
# 5   2 2015    16     3     7
# 6   2 2016    14     5     9

While dplyr and friends have built-in grouping for transformations, in base R we need to use something like ave for non-aggregating by-group calculations.

I manually exclude the columns we don't want to impute by using -(1:2); it can just as easily be done by specifying which columns you need to impute, such as quux[,3:5] <- lapply(quux[,3:5], ...).
ave only works on one column (vector) at a time, so it's not possible to directly use both Year and (say) Var 1 within it. Because of this, instead of telling ave the data is Var 1, we use row-indices as the grouped variable and use that internally to calculate on the correct indices of Year and the Var # variable.
ave returns the same class as its first argument (regardless of what is returned), so we force the row-indices to be numeric instead of integer. It isn't clear if your data is numeric or integer, so you may be able to not use as.numeric, over to you.
we could (without too much difficulty) craft our own coalesce function (similar to dplyr::coalesce, data.table::fcoalesce, and SQL's COALESCE functions) and replace much of the ifelse statement, but ... internally it would be doing much the same thing.

Data

quux <- structure(list(PID = c(1L, 1L, 1L, 2L, 2L, 2L), Year = c(2014L, 2015L, 2016L, 2014L, 2015L, 2016L), "Var 1" = c(10L, 15L, 12L, 11L, 16L, 14L), "Var 2" = c(2L, 8L, 6L, 7L, 3L, 5L), "Var 3" = c(2L, NA, 4L, 5L, NA, 9L)), class = "data.frame", row.names = c(NA, -6L))