I am training a Conditional logistic regression model with clogit
on multiple datasets (with thousands of events) in the following way:
library(survival)
library(mgcv)
# load dataset
df <- read.csv('1.csv')
model <- clogit(case ~
var1 +
# pspline(var2, df = 3) +
strata(var3),
data = df)
print(model)
summary(model)
The column types are: case: int, var1: factor, var2: int, var3: int.
If I keep this line commented: pspline(var2, df = 3) +
, the summary printing works fine when the dataset has enough cases in each strata, otherwise I get the following warning and very large standard errors:
Warning message in fitter(X, Y, strats, offset, init, control, weights = weights, :
“Loglik converged before variable 1,2,3 ; beta may be infinite. ”
However, if I use the line pspline(var2, df = 3) +
, then I do not get such warnings even when the dataset does not have enough cases in each strata. The print(model)
line works, but I get the following error when I try to access the summary
of the model:
Error in pchisq(cox$score, df, lower.tail = FALSE): Non-numeric argument to mathematical function
Traceback:
1. print(summary(model))
2. summary(model)
3. summary.coxph(model)
4. pchisq(cox$score, df, lower.tail = FALSE)
I need to access the summary because I am printing the coefficients to a csv file for later processing: summary(model)$coefficients
since I am training the models on multiple files.
I could not find a reason for this behavior, and any help would be appreciated.
Edit: 06.26 Minimum reproducible example:
num_cases = 100
var3 = rep((1:num_cases), each=3)
case = rep(c(0, 1, 1), num_cases)
var1 = factor(sample(c("Low", "Medium", "High"), num_cases, replace=TRUE, prob = c(0.5,0.35,0.25)))
var2 = runif(num_cases * 3, 10, 35)
generated_data <- data.frame(var3, case, var1, var2)
model <- clogit(case ~
var1 +
pspline(var2, df = 3) +
strata(var3),
data = generated_data)
print(model)
summary(model)$coefficients
Result:
Adding a comma after case ~ var1
does not produce the error. The code now prints coefficients, but the coefficients it returns are different to the ones returned when I remove the comma and use print(model)
.
The above code fails to converge when num_cases = 200
.
Adding the comma after case ~ var1
produces another warning:
Warning message in clogit(case ~ var1, +pspline(var2, df = 3) + strata(var3), data = generated_data): “weights ignored: not possible for the exact method”
What's happening is that an object of class 'clogit' is being passed by inheritance to summary.cph
which is apparently not designed for it. You can get the coefficients from the print.clogit
function, which is what is being implicitly called when you ask for the models results:
model
Call:
clogit(case ~ var1 + pspline(var2, df = 3) + strata(var3), data = generated_data)
coef exp(coef) se(coef) z p
var1Low -0.02063 0.97958 0.23019 -0.090 0.929
var1Medium -0.02171 0.97852 0.23831 -0.091 0.927
ps(var2)3 -0.05886 0.94284 0.40900 -0.144 0.886
ps(var2)4 -0.10752 0.89806 0.60868 -0.177 0.860
ps(var2)5 -0.16100 0.85129 0.65381 -0.246 0.805
ps(var2)6 -0.23156 0.79330 0.63652 -0.364 0.716
ps(var2)7 -0.26708 0.76561 0.61080 -0.437 0.662
ps(var2)8 -0.23270 0.79239 0.59903 -0.388 0.698
ps(var2)9 -0.23075 0.79394 0.59781 -0.386 0.700
ps(var2)10 -0.27852 0.75690 0.60117 -0.463 0.643
ps(var2)11 -0.26878 0.76431 0.64828 -0.415 0.678
ps(var2)12 -0.24330 0.78404 0.84486 -0.288 0.773
Likelihood ratio test=0.92 on 5.04 df, p=0.9702
n= 300, number of events= 200
As a bonus you get the LLR test value and a p-value. If you just want the matrix of the sort typically returned by a summary function, then make the obvious mods to the section of code in print.coxph
:
{ coef <- model$coefficients
se <- sqrt(diag(model$var))
if (is.null(coef) | is.null(se))
stop("Input is not valid")
if (is.null(model$naive.var)) {
tmp <- cbind(coef, exp(coef), se, coef/se, pchisq((coef/se)^2,
1, lower.tail = FALSE))
dimnames(tmp) <- list(names(coef), c("coef", "exp(coef)",
"se(coef)", "z", "p"))} }
And then tmp will be the desired matrix:
tmp
coef exp(coef) se(coef) z p
var1Low -0.02062820 0.9795831 0.2301878 -0.08961464 0.9285935
var1Medium -0.02171186 0.9785221 0.2383051 -0.09110952 0.9274056
ps(var2)3 -0.05886243 0.9428365 0.4089999 -0.14391797 0.8855652
ps(var2)4 -0.10752217 0.8980566 0.6086828 -0.17664728 0.8597855
ps(var2)5 -0.16099704 0.8512946 0.6538079 -0.24624519 0.8054924
ps(var2)6 -0.23155828 0.7932965 0.6365191 -0.36378845 0.7160160
ps(var2)7 -0.26708193 0.7656103 0.6108000 -0.43726573 0.6619186
ps(var2)8 -0.23269795 0.7923929 0.5990301 -0.38845785 0.6976772
ps(var2)9 -0.23074825 0.7939393 0.5978122 -0.38598783 0.6995057
ps(var2)10 -0.27852015 0.7569030 0.6011671 -0.46329903 0.6431500
ps(var2)11 -0.26877900 0.7643121 0.6482824 -0.41460174 0.6784335
ps(var2)12 -0.24329853 0.7840374 0.8448572 -0.28797593 0.7733652
I'm not sure a bug report is warranted but if you think otherwise, then Thomas Lumley is the clogit
author. The help page for clogit
does not describe a summary method and it appears that the print.clogit
, with dispatching to print.coxph
method was being used for the purposes typically assigned to summary
.
In addition, the coefficients themselves could have been obtained with model$coef
but that would not have returned the full matrix of coefficients pus ancillary statistical estimates,