rdata-analysisdata-cleaningmeasuresimputation

Imputation for longitudinal data using observation before and after missing data


I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold characters represent changes from the dataset above

The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this:
1,3,2,3,1.5,0,0,

ID# 2 (variable ss) should look like this:
2,4,0,0,0,0,0

ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this:
4,1,2,4,2,3,3

ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this:
2,1,0,NA,NA,0,0 (no change).


Solution

  • I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.

    smwrBase::fillMissing(ss, max.fill=1)
    

    The zoo package might be more standard, same issue though.

    zoo::na.approx(ss, maxgap=1)
    

    Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.

    > id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
    > time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
    > ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
    > mydat <- data.frame(id, time, ss, ss2=NA_real_)
    > for (i in unique(id)) {
    +   # interpolate for gaps
    +   mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
    +   # extension for gap as last value
    +   if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
    +     mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
    +       mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
    +   }
    + }
    > mydat
       id time ss ss2
    1   1    0  1 1.0
    2   1    1  3 3.0
    3   1    2  2 2.0
    4   1    3  3 3.0
    5   1    4 NA 1.5
    6   1    5  0 0.0
    7   1    6  0 0.0
    8   2    0  2 2.0
    9   2    1  4 4.0
    10  2    2  0 0.0
    11  2    3 NA 0.0
    12  2    4  0 0.0
    13  2    5  0 0.0
    14  2    6  0 0.0
    15  3    0  4 4.0
    16  3    1  1 1.0
    17  3    2  2 2.0
    18  3    3  4 4.0
    19  3    4  2 2.0
    20  3    5  3 3.0
    21  3    6 NA 3.0
    22  4    0  2 2.0
    23  4    1  1 1.0
    24  4    2  0 0.0
    25  4    3 NA  NA
    26  4    4 NA  NA
    27  4    5  0 0.0
    28  4    6  0 0.0
    

    The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).