rdataframerownalong-format-data

Long format data: Calculate NA for year x as row mean of other years


I have a pretty large longitudinal data set ranging from 2014 to 2021. Most of the variables are available for every years. However there are a few variables that are available for 2014 and 2016, but not for 2015. In these cases, I want to calculate the value for 2015 as the mean of the value from 2014 and 2016.

So the data structure looks as follows. Note this is extremely simplified, the data set has way more variables and observation. Also for each respondent there are rows for the other years as well (obviously), which I didnt write down here

PID Year Var 1 Var 2 Var 3
1 2014 10 2 2
1 2015 15 8 NA
1 2016 12 6 4
2 2014 11 7 5
2 2015 16 3 NA
2 2016 14 5 9

PID is the id/number that identifies each respondent. Var1 and Var2 are available for every year, Var3 is only available in 2014 and 2015

What I want is this:

PID Year Var 1 Var 2 Var 3
1 2014 10 2 2
1 2015 15 8 3
1 2016 12 6 4
2 2014 11 7 5
2 2015 16 3 7
2 2016 14 5 9

For Var3, instead of NA, the row for 2015 contains the mean of the value in 2014 and 2016. How can I achieve this?

My first ideas was to adress the missing values in 2015 by is.na() but this would address all the NAs in the whole data set and not just the NAs in 2015 for Var2. How can I adress these NAs specifically, so that it a) only calculates the value for 2015 as mean of 2014 and 2016 for Var2 and b) only for those rows where PID is the same, so that values of different respondents do not get mixed up?


Solution

  • What you're asking about is imputation, where there are different methods for how to replace empty/null/NA values. One such method involves the regressed value based on one or more other non-empty variables.

    I'm actually taking a little liberty with this ... you said "mean of 2014 and 2016", which in this case is going to be the same, but if for some reason you have 2014 and 2017 and are missing 2015-2016, then "mean" will be biased and be the same value for both 2015-2016. approximation works well here.

    dplyr

    library(dplyr)
    quux %>%
      mutate(
        across(-Year, ~ coalesce(.x, approx(Year, .x, xout = Year)$y)),
        .by = PID
      )
    #   PID Year Var 1 Var 2 Var 3
    # 1   1 2014    10     2     2
    # 2   1 2015    15     8     3
    # 3   1 2016    12     6     4
    # 4   2 2014    11     7     5
    # 5   2 2015    16     3     7
    # 6   2 2016    14     5     9
    

    coalesce is logically the same as ifelse(is.na(.x), approx(..)$y, .x), which means that if it is not NA then the original value will always be used. Frankly, we could likely just do

    ... across(-Year, ~ approx(Year, .x, xout = Year)$y)
    

    (removing the coalesce, always replacing all values) which still results in the same desired results. But since I don't know the ins and outs of your data, I thought I would be safer about the process.

    base R

    quux[,-(1:2)] <- lapply(
      quux[,-(1:2)],
      function(V) with(quux,
        ave(as.numeric(1:nrow(quux)), PID,
            FUN = function(i) ifelse(is.na(V[i]), approx(Year[i], V[i], xout=Year[i])$y, V[i]))
      )
    )
    quux
    #   PID Year Var 1 Var 2 Var 3
    # 1   1 2014    10     2     2
    # 2   1 2015    15     8     3
    # 3   1 2016    12     6     4
    # 4   2 2014    11     7     5
    # 5   2 2015    16     3     7
    # 6   2 2016    14     5     9
    

    While dplyr and friends have built-in grouping for transformations, in base R we need to use something like ave for non-aggregating by-group calculations.


    Data

    quux <- structure(list(PID = c(1L, 1L, 1L, 2L, 2L, 2L), Year = c(2014L, 2015L, 2016L, 2014L, 2015L, 2016L), "Var 1" = c(10L, 15L, 12L, 11L, 16L, 14L), "Var 2" = c(2L, 8L, 6L, 7L, 3L, 5L), "Var 3" = c(2L, NA, 4L, 5L, NA, 9L)), class = "data.frame", row.names = c(NA, -6L))