r

How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?


I've read the lubridate package manual and have queried Stack Overflow with a variety of permutations of my question but have come up with no answer to my specific problem.

What I'm trying to do is calculate age in months at time of event as the difference between date of birth and some specific event date.

As such, I imported a SAS dataset using the sas7bdat package and converted my SAS date variables (DOB and Event) to R objects using the following code:

df$DOB <- as.Date(df$DOB, origin="1960-01-01")
df$DOB1 <- ymd(df$DOB)

And same thing for the Event variable:

df$Event <- as.Date(df$Event, origin="1960-01-01")
df$Event1 <- ymd(df$Event)

However, there are some NA values for DOB. So, for the following code which I want to use to calculate age (in months).

df$interval <- new_interval(df$DOB1,df$Event1)
df$Age1 <- df$interval %/% months(1)

I'm receiving the error:

Error in est[start + est * per < end] <- est[start + est * per < end] + : NAs are not allowed in subscripted assignments

What am I doing wrong? I've tried an if/else function but perhaps used it incorrectly.

(Note: For the SAS programmers out there, I'm trying to produce the same results as the following function:

IF DOB ne . THEN Tage=Floor(intck('month',DOB,Event)-(Day(Event)<Day(DOB)));

Solution

  • Well so I give all credit for this answer to my talented work colleague. I neglected to include a reproducible example because whenever I would write a simple approximation of my problem, the df$Age1 <- df$interval %/% months(1) always worked! This left me totally stumped. It wasn't until I actually ran the code on my dataframe of 650,000+ birthdates and event dates that the error message...

    Error in est[start + est * per < end] <- est[start + est * per < end] + : NAs are not allowed in subscripted assignments

    ... would even come up! My colleague had the idea to process this calculation iteratively with the following function:

    df$Age1 = rep(NA, nrow(df))
    for (i in 1:nrow(df)) {
       df$Age1[i]<- df$interval[i] %/% months(1)
                          }
    df$Age1[1:15]
    

    Using my dataframe, it became plain to see that this calculation got hung up on row 13!

    > df$interval[13]
    [1] 1995-10-31 19:00:00 EST--1996-05-26 20:00:00 EDT
    

    So we aren't certain, but maybe the fact that the df$DOB[13] is 10/31 is screwing it up. This sort of problem with the lubridate package has been reported before (i.e., lubridate not being able to divide intervals by a period when one of the dates is at the end of the month):

    https://github.com/hadley/lubridate/issues/235

    The way we came to a solution was by using as.period and then converting it to months:

    df$Age1<- as.period(df$interval)
    head(df$Age1)
    
    [1] "1y 2m 26d 0H 0M 0S" "6m 15d 23H 0M 0S"  
    [3] "4m 9d 23H 0M 0S"    "3m 19d 23H 0M 0S"  
    [5] "3y 0m 25d 0H 0M 0S" "1y 1m 29d 1H 0M 0S"
    
    df$Age1 <- df$Age1 %/% months(1)
    head(df$Age1)
    
    [1] 14  6  4  3 36 13