pythonstatisticsregressionpoissonpolynomial-approximations

Polynomial Regression in Poisson Regression


I want to create a model which takes values from a polynomial regression and creates a Poisson Regression based upon the predicted values (by polynomial).

I only got the R code which would be something like this

glm(y ~ poly(x, 6), family = Poisson, data = data_set)

My approach so far is first calculate the Polynomial predictions for the discrete data points first and based upon this run a Poisson Regression. However, my results are a bit off.

import pandas as pd
from statsmodels.formula.api import glm
polynomial_model = np.poly1d(np.polyfit(x=bin_mids, y=count_per_bin, deg = 6))
model_values = [polynomial_model(i) for i in bin_mids]
df_1["model_values"] = model_values
poisson_model = glm("df_1['count_per_bin'] ~ df_1['model_values']" , data = df_1 ,family = sm.families.Poisson()).fit()

If anyone of you see my mistake I'd love to understand where I went wrong.

Cheers


Solution

  • Not very sure what you are doing with binning and rest of the code. What you are doing in R is a regression with orthogonal polynomials. You can see this for more info about why it's used.

    If you are using statsmodels, you just need to provide the input matrix with transformed values of x, x^2 , x^3 then perform a QR decomposition, as also touch upon in this post. you can use sklearn to get the matrix:

    import numpy as np
    import statsmodels.api as sm
    import pandas as pd
    from sklearn.preprocessing import PolynomialFeatures
    
    np.random.seed(111)
    df = pd.DataFrame({'x':np.random.uniform(0,1,50),'y':np.random.poisson(5,50)})
    
    xp = PolynomialFeatures(degree=6).fit_transform(df[['x']])
    xp = np.linalg.qr(xp)[0][:,1:]
    
    model = sm.GLM(df['y'],sm.add_constant(xp),family=sm.families.Poisson()).fit()
    model.summary()
    
                         Generalized Linear Model Regression Results                  
    ==============================================================================
    Dep. Variable:                      y   No. Observations:                   50
    Model:                            GLM   Df Residuals:                       43
    Model Family:                 Poisson   Df Model:                            6
    Link Function:                    log   Scale:                          1.0000
    Method:                          IRLS   Log-Likelihood:                -100.45
    Date:                Thu, 04 Mar 2021   Deviance:                       33.846
    Time:                        22:32:57   Pearson chi2:                     31.8
    No. Iterations:                     4                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const          1.5458      0.066     23.495      0.000       1.417       1.675
    x1            -0.2059      0.439     -0.469      0.639      -1.067       0.655
    x2             0.8169      0.466      1.754      0.079      -0.096       1.730
    x3             0.1178      0.442      0.267      0.790      -0.748       0.984
    x4            -0.5503      0.454     -1.212      0.226      -1.440       0.340
    x5             0.0035      0.457      0.008      0.994      -0.892       0.899
    x6            -0.6878      0.455     -1.512      0.131      -1.579       0.204
    ==============================================================================
    

    We write the data to csv

    df.to_csv("data.csv")
    

    and fit using R, you can see we get the same coefficients:

    df = read.csv("data.csv",row.names=1)
    summary(glm(y ~ poly(x, 6), family = poisson, data = df))
    
    Call:
    glm(formula = y ~ poly(x, 6), family = poisson, data = df)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -2.42648  -0.73970  -0.04625   0.61351   1.81234  
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  1.545808   0.065794  23.495   <2e-16 ***
    poly(x, 6)1 -0.205940   0.439457  -0.469   0.6393    
    poly(x, 6)2  0.816910   0.465770   1.754   0.0794 .  
    poly(x, 6)3  0.117815   0.441847   0.267   0.7897    
    poly(x, 6)4 -0.550271   0.454099  -1.212   0.2256    
    poly(x, 6)5 -0.003508   0.456691  -0.008   0.9939    
    poly(x, 6)6  0.687751   0.454953   1.512   0.1306    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for poisson family taken to be 1)
    
        Null deviance: 40.443  on 49  degrees of freedom
    Residual deviance: 33.846  on 43  degrees of freedom
    AIC: 214.91