rstatalogistic-regression

Replicating logistic regression results from Stata in R


I'm facing an issue with replicating the results of a basic logistic regression model in R. The model was originally run in Stata.

The issue is that all of my odds ratios (and their underlying coefficients) are the same, except for one.

Stata code:

logistic sfcancer2yr c.age i.bminewcat i.n_rels i.race_eth i.meno i.bdnewcat i.biopsycoll i.birads i.bdnewcat#i.meno

R code:

trainmodel <- glm(sfcancer2yr ~ age + factor(bminewcat) + n_rels + factor(race_eth) + factor(meno) + bdnewcat + biopsycoll + birads + bdnewcat:meno, family = "binomial"(link = "logit"), data = traindata)

coef.orig <- coef(trainmodel)

exp(trainmodel$coefficients[-1])

When reviewing the results, it looks like only the OR for the variable "meno" is different.

Stata results:


Logistic regression                                     Number of obs = 68,605
                                                        LR chi2(13)   = 402.74
                                                        Prob > chi2   = 0.0000
Log likelihood = -6067.6077                             Pseudo R2     = 0.0321

-------------------------------------------------------------------------------
  sfcancer2yr | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
--------------+----------------------------------------------------------------
          age |   1.025027   .0033988     7.45   0.000     1.018388    1.031711
              |
    bminewcat |
           2  |   1.383954    .095462     4.71   0.000     1.208948    1.584294
           3  |   1.385149   .1084733     4.16   0.000     1.188058    1.614937
              |
     1.n_rels |   1.580755    .111435     6.50   0.000     1.376763    1.814971
              |
     race_eth |
           2  |   .7453006   .1084842    -2.02   0.043     .5603155    .9913575
           3  |   .4592529    .087328    -4.09   0.000     .3163691    .6666683
           4  |   .8301742   .1199594    -1.29   0.198     .6254195    1.101963
           9  |   .7560139   .1817387    -1.16   0.245     .4719641    1.211018
              |
       1.meno |   1.680891    .253618     3.44   0.001     1.250569    2.259287
   2.bdnewcat |   2.710396   .3969418     6.81   0.000     2.034102    3.611541
 1.biopsycoll |   1.632317   .1289949     6.20   0.000     1.398099    1.905774
     2.birads |   1.469719   .0995559     5.68   0.000     1.286991     1.67839
              |
bdnewcat#meno |
         2 1  |   .5937504   .0942284    -3.28   0.001     .4350287    .8103821
              |
        _cons |     .00155   .0003223   -31.12   0.000     .0010312    .0023298
-------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

R results

tidy(trainmodel, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE)

# A tibble: 14 × 7
   term               estimate std.error statistic   p.value conf.low conf.high
   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)        0.000389   0.325      -24.2  3.29e-129 0.000203  0.000726
 2 age                1.03       0.00332      7.46 8.98e- 14 1.02      1.03    
 3 factor(bminewcat)2 1.38       0.0690       4.71 2.47e-  6 1.21      1.58    
 4 factor(bminewcat)3 1.39       0.0783       4.16 3.18e-  5 1.19      1.61    
 5 n_rels             1.58       0.0705       6.50 8.27e- 11 1.37      1.81    
 6 factor(race_eth)2  0.745      0.146       -2.02 4.34e-  2 0.554     0.980   
 7 factor(race_eth)3  0.459      0.190       -4.09 4.26e-  5 0.309     0.654   
 8 factor(race_eth)4  0.830      0.144       -1.29 1.98e-  1 0.618     1.09    
 9 factor(race_eth)9  0.756      0.240       -1.16 2.45e-  1 0.455     1.17    
10 factor(meno)1      2.83       0.294        3.54 3.93e-  4 1.62      5.12    
11 bdnewcat           2.71       0.146        6.81 9.79e- 12 2.05      3.64    
12 biopsycoll         1.63       0.0790       6.20 5.63e- 10 1.40      1.90    
13 birads             1.47       0.0677       5.68 1.31e-  8 1.29      1.68    
14 bdnewcat:meno      0.594      0.159       -3.29 1.02e-  3 0.432     0.805   

As you can see, the OR for 'meno' increases in R. I'm stumped as to why. I'd really like to figure out what's causing this and what's the correct set of results to use.

My data is more than 70,000 observations; I am not familiar with how to provide a data example but if someone could provide guidance on this as well I would be grateful.

Things I've already tried:


Solution

  • In Stata, variables meno and bdnewcat were treated as factor (categorical) variables in main effect and the interaction terms. In R, you treated the variable bdnewcat as a numeric (or continuous) variable in main effect and interaction terms. It is likely that the variable bdnewcat is currently defined as numeric in R instead of factor in the data frame.

    I would try to explicitly define bdnewcat and meno as factor in logistic model:

    trainmodel <- glm(
      sfcancer2yr ~ age + factor(bminewcat) + n_rels + factor(race_eth) + 
        factor(meno) + factor(bdnewcat) + biopsycoll + birads + 
        factor(bdnewcat):factor(meno),
      family = "binomial"(link = "logit"),
      data = traindata
    )
    

    Please check if it makes any difference as I can't check it without the data.