I have a pretty large longitudinal data set ranging from 2014 to 2021. Most of the variables are available for every years. However there are a few variables that are available for 2014 and 2016, but not for 2015. In these cases, I want to calculate the value for 2015 as the mean of the value from 2014 and 2016.
So the data structure looks as follows. Note this is extremely simplified, the data set has way more variables and observation. Also for each respondent there are rows for the other years as well (obviously), which I didnt write down here
PID | Year | Var 1 | Var 2 | Var 3 |
---|---|---|---|---|
1 | 2014 | 10 | 2 | 2 |
1 | 2015 | 15 | 8 | NA |
1 | 2016 | 12 | 6 | 4 |
2 | 2014 | 11 | 7 | 5 |
2 | 2015 | 16 | 3 | NA |
2 | 2016 | 14 | 5 | 9 |
PID is the id/number that identifies each respondent. Var1 and Var2 are available for every year, Var3 is only available in 2014 and 2015
What I want is this:
PID | Year | Var 1 | Var 2 | Var 3 |
---|---|---|---|---|
1 | 2014 | 10 | 2 | 2 |
1 | 2015 | 15 | 8 | 3 |
1 | 2016 | 12 | 6 | 4 |
2 | 2014 | 11 | 7 | 5 |
2 | 2015 | 16 | 3 | 7 |
2 | 2016 | 14 | 5 | 9 |
For Var3, instead of NA, the row for 2015 contains the mean of the value in 2014 and 2016. How can I achieve this?
My first ideas was to adress the missing values in 2015 by is.na()
but this would address all the NAs in the whole data set and not just the NAs in 2015 for Var2. How can I adress these NAs specifically, so that it a) only calculates the value for 2015 as mean of 2014 and 2016 for Var2 and b) only for those rows where PID is the same, so that values of different respondents do not get mixed up?
What you're asking about is imputation, where there are different methods for how to replace empty/null/NA
values. One such method involves the regressed value based on one or more other non-empty variables.
I'm actually taking a little liberty with this ... you said "mean of 2014 and 2016", which in this case is going to be the same, but if for some reason you have 2014 and 2017 and are missing 2015-2016, then "mean" will be biased and be the same value for both 2015-2016. approx
imation works well here.
library(dplyr)
quux %>%
mutate(
across(-Year, ~ coalesce(.x, approx(Year, .x, xout = Year)$y)),
.by = PID
)
# PID Year Var 1 Var 2 Var 3
# 1 1 2014 10 2 2
# 2 1 2015 15 8 3
# 3 1 2016 12 6 4
# 4 2 2014 11 7 5
# 5 2 2015 16 3 7
# 6 2 2016 14 5 9
coalesce
is logically the same as ifelse(is.na(.x), approx(..)$y, .x)
, which means that if it is not NA
then the original value will always be used. Frankly, we could likely just do
... across(-Year, ~ approx(Year, .x, xout = Year)$y)
(removing the coalesce
, always replacing all values) which still results in the same desired results. But since I don't know the ins and outs of your data, I thought I would be safer about the process.
quux[,-(1:2)] <- lapply(
quux[,-(1:2)],
function(V) with(quux,
ave(as.numeric(1:nrow(quux)), PID,
FUN = function(i) ifelse(is.na(V[i]), approx(Year[i], V[i], xout=Year[i])$y, V[i]))
)
)
quux
# PID Year Var 1 Var 2 Var 3
# 1 1 2014 10 2 2
# 2 1 2015 15 8 3
# 3 1 2016 12 6 4
# 4 2 2014 11 7 5
# 5 2 2015 16 3 7
# 6 2 2016 14 5 9
While dplyr
and friends have built-in grouping for transformations, in base R we need to use something like ave
for non-aggregating by-group calculations.
-(1:2)
; it can just as easily be done by specifying which columns you need to impute, such as quux[,3:5] <- lapply(quux[,3:5], ...)
.ave
only works on one column (vector) at a time, so it's not possible to directly use both Year
and (say) Var 1
within it. Because of this, instead of telling ave
the data is Var 1
, we use row-indices as the grouped variable and use that internally to calculate on the correct indices of Year
and the Var #
variable.ave
returns the same class as its first argument (regardless of what is returned), so we force the row-indices to be numeric
instead of integer
. It isn't clear if your data is numeric
or integer
, so you may be able to not use as.numeric
, over to you.coalesce
function (similar to dplyr::coalesce
, data.table::fcoalesce
, and SQL's COALESCE
functions) and replace much of the ifelse
statement, but ... internally it would be doing much the same thing.Data
quux <- structure(list(PID = c(1L, 1L, 1L, 2L, 2L, 2L), Year = c(2014L, 2015L, 2016L, 2014L, 2015L, 2016L), "Var 1" = c(10L, 15L, 12L, 11L, 16L, 14L), "Var 2" = c(2L, 8L, 6L, 7L, 3L, 5L), "Var 3" = c(2L, NA, 4L, 5L, NA, 9L)), class = "data.frame", row.names = c(NA, -6L))