Dear StackOverflow community, as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background: Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg. I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue: As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1 (not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot Ggforest_plot
Picture GT_plot2 GT_plot2
It seems that there are two distinct problems here:
Problem #1. getting gtsummary()
to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest()
to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice()
we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice
and then matching with MatchThem
works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice()
:
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice()
and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph()
to survival::coxph()
in block #3 when you call the with()
function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph()
and pool_and_mice()
, hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest()
cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot()
as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex