I'm facing an issue with replicating the results of a basic logistic regression model in R. The model was originally run in Stata.
The issue is that all of my odds ratios (and their underlying coefficients) are the same, except for one.
Stata code:
logistic sfcancer2yr c.age i.bminewcat i.n_rels i.race_eth i.meno i.bdnewcat i.biopsycoll i.birads i.bdnewcat#i.meno
R code:
trainmodel <- glm(sfcancer2yr ~ age + factor(bminewcat) + n_rels + factor(race_eth) + factor(meno) + bdnewcat + biopsycoll + birads + bdnewcat:meno, family = "binomial"(link = "logit"), data = traindata)
coef.orig <- coef(trainmodel)
exp(trainmodel$coefficients[-1])
When reviewing the results, it looks like only the OR for the variable "meno" is different.
Stata results:
Logistic regression Number of obs = 68,605
LR chi2(13) = 402.74
Prob > chi2 = 0.0000
Log likelihood = -6067.6077 Pseudo R2 = 0.0321
-------------------------------------------------------------------------------
sfcancer2yr | Odds ratio Std. err. z P>|z| [95% conf. interval]
--------------+----------------------------------------------------------------
age | 1.025027 .0033988 7.45 0.000 1.018388 1.031711
|
bminewcat |
2 | 1.383954 .095462 4.71 0.000 1.208948 1.584294
3 | 1.385149 .1084733 4.16 0.000 1.188058 1.614937
|
1.n_rels | 1.580755 .111435 6.50 0.000 1.376763 1.814971
|
race_eth |
2 | .7453006 .1084842 -2.02 0.043 .5603155 .9913575
3 | .4592529 .087328 -4.09 0.000 .3163691 .6666683
4 | .8301742 .1199594 -1.29 0.198 .6254195 1.101963
9 | .7560139 .1817387 -1.16 0.245 .4719641 1.211018
|
1.meno | 1.680891 .253618 3.44 0.001 1.250569 2.259287
2.bdnewcat | 2.710396 .3969418 6.81 0.000 2.034102 3.611541
1.biopsycoll | 1.632317 .1289949 6.20 0.000 1.398099 1.905774
2.birads | 1.469719 .0995559 5.68 0.000 1.286991 1.67839
|
bdnewcat#meno |
2 1 | .5937504 .0942284 -3.28 0.001 .4350287 .8103821
|
_cons | .00155 .0003223 -31.12 0.000 .0010312 .0023298
-------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
R results
tidy(trainmodel, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE)
# A tibble: 14 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.000389 0.325 -24.2 3.29e-129 0.000203 0.000726
2 age 1.03 0.00332 7.46 8.98e- 14 1.02 1.03
3 factor(bminewcat)2 1.38 0.0690 4.71 2.47e- 6 1.21 1.58
4 factor(bminewcat)3 1.39 0.0783 4.16 3.18e- 5 1.19 1.61
5 n_rels 1.58 0.0705 6.50 8.27e- 11 1.37 1.81
6 factor(race_eth)2 0.745 0.146 -2.02 4.34e- 2 0.554 0.980
7 factor(race_eth)3 0.459 0.190 -4.09 4.26e- 5 0.309 0.654
8 factor(race_eth)4 0.830 0.144 -1.29 1.98e- 1 0.618 1.09
9 factor(race_eth)9 0.756 0.240 -1.16 2.45e- 1 0.455 1.17
10 factor(meno)1 2.83 0.294 3.54 3.93e- 4 1.62 5.12
11 bdnewcat 2.71 0.146 6.81 9.79e- 12 2.05 3.64
12 biopsycoll 1.63 0.0790 6.20 5.63e- 10 1.40 1.90
13 birads 1.47 0.0677 5.68 1.31e- 8 1.29 1.68
14 bdnewcat:meno 0.594 0.159 -3.29 1.02e- 3 0.432 0.805
As you can see, the OR for 'meno' increases in R. I'm stumped as to why. I'd really like to figure out what's causing this and what's the correct set of results to use.
My data is more than 70,000 observations; I am not familiar with how to provide a data example but if someone could provide guidance on this as well I would be grateful.
Things I've already tried:
In Stata, variables meno
and bdnewcat
were treated as factor (categorical) variables in main effect and the interaction terms. In R, you treated the variable bdnewcat
as a numeric (or continuous) variable in main effect and interaction terms. It is likely that the variable bdnewcat
is currently defined as numeric in R instead of factor in the data frame.
I would try to explicitly define bdnewcat
and meno
as factor in logistic model:
trainmodel <- glm(
sfcancer2yr ~ age + factor(bminewcat) + n_rels + factor(race_eth) +
factor(meno) + factor(bdnewcat) + biopsycoll + birads +
factor(bdnewcat):factor(meno),
family = "binomial"(link = "logit"),
data = traindata
)
Please check if it makes any difference as I can't check it without the data.