pythonrpy2

How to extract glm() results from R objects using Pythons rpy2?


I use Python rpy2 to run glm() (regression model) in R. My problem is to get a result back that I can handle. The best would be a nice data structure (dict, dataframe, ...) but I also would be satisfied with a human readable multi-line string.

Using glm() directly in R produces outputs like, and that is what I expect:

Call:
glm(formula = C ~ B + A, family = poisson(), data = df, offset = D)

Coefficients:
              Estimate Std. Error   z value Pr(>|z|)
(Intercept) -2.391e+01  3.171e-03 -7540.871   <2e-16 ***
B           -2.372e-04  1.331e-04    -1.782   0.0748 .
A            1.285e-03  1.345e-04     9.552   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 17500269  on 49999  degrees of freedom
Residual deviance: 17500174  on 49997  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 16

In the example code below I try two variants of extracting the result.

In Variant A I do use capture.output() on the R side to extract the console output as a string. But the results looks akward (see doc string in code). In Variant B I use the summary() function via rpy2. Here I get a wired data structure I am not able to work with.

#!/usr/bin/env python3
import random
import pandas as pd
import rpy2
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri as pandas2ri
pandas2ri.activate()

random.seed(0)
k = 10000

df = pd.DataFrame({
    'A': random.choices(range(100), k=k),
    'B': random.choices([1, 2, 3], k=k),
    'C': random.choices([0, 1], k=k),
    'D': random.choices(range(20, 30), k=k),
})

glm = robjects.r['glm']

reg = glm(
    formula='C ~ B',
    data=robjects.conversion.py2rpy(df),
    family=robjects.r['poisson'](),
    offset=df['D']
)    

robjects.r('''get_result <- function(reg) {
            result <- paste(capture.output(reg), collapse="\n")
            return(result)
        }
''')
get_result = robjects.globalenv['get_result']

result_variant_A = get_result(reg)
"""
[...SNIPPED...]
L, `9948` = 23L, `9949` = 23L, `9950` = 22L, `9951` = 22L, \n`9952` = 29L, `9953` = 27L, `9954` = 26L, `9955` = 23L, `9956` = 28L, \n`9957` = 23L, `9958` = 22L, `9959` = 26L, `9960` = 21L, `9961` = 24L, \n`9962` = 28L, `9963` = 22L, `9964` = 24L, `9965` = 26L, `9966` = 29L, \n`9967` = 20L, `9968` = 25L, `9969` = 22L, `9970` = 20L, `9971` = 22L, \n`9972` = 28L, `9973` = 29L, `9974` = 24L, `9975` = 21L, `9976` = 28L, \n`9977` = 22L, `9978` = 27L, `9979` = 26L, `9980` = 23L, `9981` = 22L, \n`9982` = 22L, `9983` = 21L, `9984` = 22L, `9985` = 28L, `9986` = 23L, \n`9987` = 23L, `9988` = 21L, `9989` = 28L, `9990` = 28L, `9991` = 20L, \n`9992` = 20L, `9993` = 20L, `9994` = 24L, `9995` = 22L, `9996` = 24L, \n`9997` = 20L, `9998` = 20L, `9999` = 22L))\n\nCoefficients:\n(Intercept)            B  \n  -27.79929     -0.01377  \n\nDegrees of Freedom: 9999 Total (i.e. Null);  9998 Residual\nNull Deviance:\t    33360 \nResidual Deviance: 33360 \tAIC: 43370'],
      dtype='<U419704')
>>> type(result)
<class 'numpy.ndarray'>
"""

result_variant_B = robjects.r['summary'](reg)
"""
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
  call: <class 'rpy2.robjects.language.LangVector'>
  Rlang( (function (formula, family = gaussian, data, weights, subset,  )
  terms: <class 'rpy2.robjects.Formula'>
  <rpy2.robjects.Formula object at 0x000002C9B2630890> [6]
R classes: ('terms', 'formula')
<rpy2.robjects.vectors.ListVector object at 0x000002C9B71F6A10> [19]
R classes: ('summary.glm',)
[LangSexpV..., LangSexpV..., ListSexpV..., FloatSexp..., ..., FloatSexp..., IntSexpVe..., FloatSexp..., FloatSexp...]
  deviance: <class 'numpy.ndarray'>
  array([33357.71065632])
...
  contrasts: <class 'numpy.ndarray'>
  array([1.])
  df.residual: <class 'rpy2.robjects.vectors.IntVector'>
  <rpy2.robjects.vectors.IntVector object at 0x000002C9B2630890> [13]
R classes: ('integer',)
[2, 9998, 2]
  null.deviance: <class 'numpy.ndarray'>
  array([[ 0.00138301, -0.00059514],
       [-0.00059514,  0.00029936]])
  df.null: <class 'numpy.ndarray'>
  array([[ 0.00138301, -0.00059514],
       [-0.00059514,  0.00029936]])
"""

Solution

  • You can use rx or rx2 to access elements within your R objects. For example, to access the coefficients of a GLM model, you can use the following:

    reg.rx2('coefficients')
    

    In general:

    For more details, refer to the rpy2 documentation on linear models.

    Alternatively, assuming you have the broom package installed, you can use the broom::tidy package in R to convert the results of your model into a tidy R data frame, which can then be converted to a pandas DataFrame for easier access in Python. Here's how:

    # Convert model results using broom::tidy
    tidy = robjects.r('broom::tidy')
    tidy_reg = tidy(reg)
    
    # Convert the R dataframe to a pandas dataframe
    glm_result = pandas2ri.rpy2py(tidy_reg)
    
    # Accessing elements in the pandas DataFrame
    print(glm_result)
    print(glm_result.at['2', 'estimate'])  # Note: Index values are strings.
    

    The resulting pandas DataFrame will have string-based indices. You can reset the index if needed and use Python's standard indexing methods. Below is the full code:

    import random
    import pandas as pd
    import rpy2.robjects as robjects
    import rpy2.robjects.pandas2ri as pandas2ri
    pandas2ri.activate()
    
    random.seed(0)
    k = 10000
    
    df = pd.DataFrame({
        'A': random.choices(range(100), k=k),
        'B': random.choices([1, 2, 3], k=k),
        'C': random.choices([0, 1], k=k),
        'D': random.choices(range(20, 30), k=k),
    })
    
    glm = robjects.r['glm']
    
    reg = glm(
        formula='C ~ B',
        data=robjects.conversion.py2rpy(df),
        family=robjects.r['poisson'](),
        offset=df['D'])    
    
    print(robjects.r("summary")(reg))
    print(reg.rx2('coefficients'))
    tidy = robjects.r('broom::tidy')
    tidy_reg = tidy(reg)
    
    glm_result = pandas2ri.rpy2py(tidy_reg)
    print(glm_result)
    print(glm_result.at['2','estimate'])