Good afternoon,
I could post reproducible code and certainly will if everyone agrees that something is wrong, but right now I think my question is quite simple and someone will point me the right path.
I am working in a data set like this:
created_as_free_user t c
<fctr> <int> <int>
1 true 36 0
2 true 36 0
3 true 0 1
4 true 28 0
5 true 9 0
6 true 0 1
7 true 13 0
8 true 19 0
9 true 9 0
10 true 16 0
I fitted a Cox Regression model like this:
fit_train = coxph(Surv(time = t,event = c) ~ created_as_free_user ,data = teste)
summary(fit_train)
And received:
Call:
coxph(formula = Surv(time = t, event = c) ~ created_as_free_user,
data = teste)
n= 9000, number of events= 1233
coef exp(coef) se(coef) z Pr(>|z|)
created_as_free_usertrue -0.7205 0.4865 0.1628 -4.426 9.59e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
created_as_free_usertrue 0.4865 2.055 0.3536 0.6693
Concordance= 0.511 (se = 0.002 )
Rsquare= 0.002 (max possible= 0.908 )
Likelihood ratio test= 15.81 on 1 df, p=7e-05
Wald test = 19.59 on 1 df, p=9.589e-06
Score (logrank) test = 20.45 on 1 df, p=6.109e-06
So far so good. Next step: Predict the results on new data. I understand the different types of predictions that predict.coxph can give me (or at least I think I do). Let's use the type = "lp":
head(predict(fit_train,validacao,type = "lp"),n=20)
And get:
1 2 3 4 5 6 7 8 9 10
-0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854
11 12 13 14 15 16 17 18 19 20
-0.01208854 -0.01208854 0.70842049 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854
OK. But when I look at the data that I am trying to estimate:
# A tibble: 9,000 × 3
created_as_free_user t c
<fctr> <int> <int>
1 true 20 0
2 true 12 0
3 true 0 1
4 true 10 0
5 true 51 0
6 true 36 0
7 true 44 0
8 true 0 1
9 true 27 0
10 true 6 0
# ... with 8,990 more rows
It makes me confuse....
The type = "lp" isn't suppose to give you the linear predictions? For this data above that I am trying to estimate, since the created_as_free_user variable is equal to true, Am I wrong expecting the type = "lp" prediction to be exactaly -0.7205 (the coefficient of the model above)? Where does the -0.01208854 came? I suspect it's some sort of scale situation, but couldn't find the answer online.
My final objective is the h(t) that is given by the prediction type = "expected", but I am not all that comfortable using it because it uses this -0.01208854 value that I don't fully understand.
Thanks a lot
The Details section in ?predict.coxph
reads:
The Cox model is a relative risk model; predictions of type "linear predictor", "risk", and "terms" are all relative to the sample from which they came. By default, the reference value for each of these is the mean covariate within strata.
To illustrate what this means, we can look at a simple example. Some fake data:
test1 <- list(time=c(4,3,1,1,1),
status=c(1,1,1,0,0),
x=c(0,2,1,1,0))
We fit a model and view predictions:
fit <- coxph(Surv(time, status) ~ x, test1)
predict(fit, type = "lp")
# [1] -0.6976630 1.0464945 0.1744157 0.1744157 -0.6976630
The predictions are the same as:
(test1$x - mean(test1$x)) * coef(fit)
# [1] -0.6976630 1.0464945 0.1744157 0.1744157 -0.6976630
(Using this logic and some arithmetic we can back out from your results that you have 8849 "trues" out of 9000 observations for your created_as_free_user
variable.)