r date survival-analysis cox-regression survival

tmerge() + coxph(): two ways of setting up dates should give same results, and don't

Basically, using tmerge() to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.

One way uses start and end dates, and converts to numeric within Surv(); the other just uses numeric days to event.

Example

First, create some data. We have an ID, an outcome (death), a start date for each row, and an end date some time later. The start date and end date are Date objects.

n <- 1000
set.seed(0)
dd <- data.frame(id=1:n, 
  death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE), 
  startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")), 
    max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) + 
    rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (You can check that endDate is never before startDate.)

Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:

dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)

Next, we use tmerge() to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)

We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.

Using days

ddTv <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]

id death  startDate    endDate startDay endDay tstart  tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6      0 3506.6  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6      0 4570.6 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6      0 3571.6 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1      0 2955.1 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4      0 2913.4  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2      0 3615.2 FALSE

Using dates

ddTvDate <- tmerge(data1=dd, data2=dd, id=id, 
  tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]

id death  startDate    endDate startDay endDay     tstart      tstop event
 1  TRUE 2005-04-23 2014-11-29        0 3506.6 2005-04-23 2014-11-29  TRUE
 2 FALSE 2010-08-13 2023-02-16        0 4570.6 2010-08-13 2023-02-16 FALSE
 3 FALSE 2013-09-11 2023-06-22        0 3571.6 2013-09-11 2023-06-22 FALSE
 4 FALSE 2007-08-31 2015-10-03        0 2955.1 2007-08-31 2015-10-03 FALSE
 5  TRUE 2019-02-05 2027-01-27        0 2913.4 2019-02-05 2027-01-27  TRUE
 6 FALSE 2002-05-14 2012-04-06        0 3615.2 2002-05-14 2012-04-06 FALSE

Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:

Using days

ddMod <- coxph(formula=
    Surv(time=tstart, time2=tstop, event=death) ~ 1, 
  data=ddTv)
ddMod

Null model
  log likelihood= -702.08 
  n= 1000

Using dates

ddModDate <- coxph(formula=
    Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1, 
  data=ddTvDate)
ddModDate

Null model
  log likelihood= -681.85 
  n= 1000

Log-likelihoods are similar, but not the same.

Why are these not the same?

If you add covariates to the model then coefficients and p values between the two versions are again not the same.

Finally, if you don't use tmerge(), and go straight to coxph() on the original dataset, then both methods give you the same results. Both of these models

ddMod2 <- coxph(formula=
    Surv(time=endDay, event=death) ~ 1, 
  data=dd)
ddMod2

ddModDate2 <- coxph(formula=
    Surv(time=as.numeric(endDate - startDate), event=death) ~ 1, 
  data=dd)
ddModDate2

give the same results as ddMod above, the version using days.

Solution

Professor Terry M. Therneau (creator of the survival package) kindly gave me an answer, with permission to post here.

Paraphrasing ---

Basically, the results are different because those are two entirely different models. Consider, for example, a participant who had an event on the 100th day that they were in the study, on January 1, 2010.

If I use calendar dates for my times, then the risk set for that event is everyone who was in the study on January 1, 2010.
If I use time since entry for my times, then the risk set for that event is everyone who was still in the study on their 100th day since entry.

Those are probably very different sets of people!

For almost every study, time since entry is the measure you want.

Obvious once he points it out, opaque to me until then.