I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this:
1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this:
2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this:
4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this:
2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).