I am building a Cox PH model using the survival package in R and would like to include a time-dependent coefficient for my categorical variable. Reproducible data set up:
library(survival)
# Data
stanford <- stanford2
stanford$age_cat <- ifelse(stanford$age > 35, "old", "young")
Working from the time-dependent vignette here for the survival package, I need to use the tt()
function. Attempt 1 revealed I needed dummy coding.
mod.fail <- coxph(Surv(time, status) ~ tt(age_cat),
data = stanford,
tt = function(x, t, ...) x*t)
Error in x * t : non-numeric argument to binary operator
So, add this indicator variable.
# Create dummy coding of age_cat
stanford$age_cat_d <- ifelse(stanford$age_cat == "old", 1, 0)
Now, I am confused how to properly specify the model. Both of the below will run, but I am not sure which provides the right solution to letting the effect of the age category vary over time.
# Model 1
mod.t1 <- coxph(Surv(time, status) ~ tt(age_cat_d),
data = stanford,
tt = function(x, t, ...) x*t)
# Model 2
mod.t2 <- coxph(Surv(time, status) ~ age_cat_d + tt(age_cat_d),
data = stanford,
tt = function(x, t, ...) x*t)
Below is how I would think we should estimate the effect of the age category at time=200 in each model, showing the models are different.
# Model 1
coef(mod.t1)[1]*200
tt(age_cat_d)
0.04425679
# Model 2
coef(mod.t2)[1]+coef(mod.t2)[2]*200
age_cat_d
0.5424105
So, are either of the above models the correct way to implement a time-dependent coefficient for the age category? The examples in the linked vignette (and other guides for using tt()
I've found) focus on time-dependent coefficients for continuous variables. (Note: The above example is just for reproducibility; I am not arguing we should create such a time-dependent model for the given data)
[1]: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
As tt()
declares the transformation for time-varying coefficients regardless of whether your covariate is continuous or discrete, this is a question of understanding the model you are fitting when you drop the "main term" from a time-varying coefficient Cox model and how to interpret the parameter estimates.
The easiest way to answer this is probably to go through different model specifications (via syntax) and explain what they're doing.
library(survival)
# Data
stanford <- stanford2
stanford$age_b <- ifelse(stanford$age > 35, 1, 0) # add binary covariate
# Function computing time functional form
myfun <- function(x, t, ...){ x * log(t + 20)}
Age has one time-invariant effect.
coxph(Surv(time, status) ~ age, data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ age, data = stanford)
#>
#> coef exp(coef) se(coef) z p
#> age 0.02917 1.02960 0.01064 2.741 0.00613
#>
#> Likelihood ratio test=8.27 on 1 df, p=0.004034
#> n= 184, number of events= 113
The effect of age now also varies across time. The total effect of age is decomposed into a time-invariant term (the coefficient on age
) and a time-varying term (the coefficient on tt(age)
). The total effect of age in this example is -.007 + .007*log(t+20) based on the function used for tt()
. This interpretation is provided in the time-varying coefficient vignette.
coxph(Surv(time, status) ~ age + tt(age),
tt = myfun,
data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ age + tt(age), data = stanford,
#> tt = myfun)
#>
#> coef exp(coef) se(coef) z p
#> age -0.007256 0.992770 0.042434 -0.171 0.864
#> tt(age) 0.007182 1.007208 0.008190 0.877 0.381
#>
#> Likelihood ratio test=9.04 on 2 df, p=0.01086
#> n= 184, number of events= 113
Similar to Model 2, we're letting the effect of age vary with time. However, we no longer are separately estimating the time-varying component and the time-invariant component. Instead, we're directly estimating the total effect of age, which can vary across time. The total effect of age is .006*log(t+20).
coxph(Surv(time, status) ~ tt(age),
tt = myfun,
data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ tt(age), data = stanford,
#> tt = myfun)
#>
#> coef exp(coef) se(coef) z p
#> tt(age) 0.005829 1.005846 0.002046 2.849 0.00439
#>
#> Likelihood ratio test=9.02 on 1 df, p=0.002677
#> n= 184, number of events= 113
Now let's try to fit these models with a binary covariate instead of a continuous one. The coefficient estimates change but they still represent the same concepts with respect to time.
Same as Model 1 Continuous: age has one time-invariant effect. Now instead of that effect being the effect of a 1-unit change in continuous age, it's the effect of being old rather than young.
coxph(Surv(time, status) ~ age_b, data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ age_b, data = stanford)
#>
#> coef exp(coef) se(coef) z p
#> age_b 0.2721 1.3128 0.2304 1.181 0.238
#>
#> Likelihood ratio test=1.47 on 1 df, p=0.2258
#> n= 184, number of events= 113
### Model 2 Binary: binary covariate, adding time-varying coefficient
Same as Model 2 Continuous: the effect of age now also varies across time. The total effect of age is decomposed into a time-invariant term (the coefficient on age) and a time-varying term (the coefficient on tt(age)). The total effect of age in this example is .025 + .050*log(t+20) based on the function used for `tt()`. That is the effect of being old rather than young.
```r
coxph(Surv(time, status) ~ age_b + tt(age_b),
tt = myfun,
data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ age_b + tt(age_b), data = stanford,
#> tt = myfun)
#>
#> coef exp(coef) se(coef) z p
#> age_b 0.02475 1.02506 0.92143 0.027 0.979
#> tt(age_b) 0.04680 1.04791 0.16956 0.276 0.783
#>
#> Likelihood ratio test=1.54 on 2 df, p=0.4621
#> n= 184, number of events= 113
Once again, we are now estimating the total time-varying effect of being old vs. young rather than decomposing the total effect into time-varying and time-invariant components.
coxph(Surv(time, status) ~ tt(age_b),
tt = myfun,
data = stanford)
#> Call:
#> coxph(formula = Surv(time, status) ~ tt(age_b), data = stanford,
#> tt = myfun)
#>
#> coef exp(coef) se(coef) z p
#> tt(age_b) 0.05121 1.05255 0.04239 1.208 0.227
#>
#> Likelihood ratio test=1.54 on 1 df, p=0.2142
#> n= 184, number of events= 113