python pandas regression statsmodels patsy

Adding Regularization of Coefficients to Statsmodels (or Patsy)

Given that I have the following patsy formula,

'y ~ a + b + c'

and pass it to statsmodels.ols, how can a add a regularization term to the regression coefficients?

In this case, I wish to create my own penalisation function, not simply use ridge, lasso or elasticnet regression.

Here is a reproducable example with similar data to my problem:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

a = np.clip(np.random.normal(loc=60, scale=40, size=(100)), 0, 100)
b = np.clip(np.random.normal(loc=40, scale=40, size=(100)), 0, 100)
c = np.clip(np.random.normal(loc=20, scale=20, size=(100)), 0, 100)

y = (
    32 * (a + 8 * np.random.random(a.shape))
    + 21 * (b + 5 * np.random.random(b.shape))
    + 36 * (c + 5 * np.random.random(c.shape))
) + (50 * np.random.random(a.shape))

data = pd.DataFrame(
    data=np.array([a, b, c, y]).T,
    columns=['a', 'b', 'c', 'y']
)

formula = 'y ~ a + b + c'

mod = smf.ols(formula=formula, data=data,)

Solution

OLS relies on linear algebra computation to compute the parameter estimates and so cannot directly handle extensions that require nonlinear optimization.

GLM with family Gaussian is equivalent to OLS but uses (optionally) nonlinear optimization to find the maximum likelihood parameter estimates. So, some extension to OLS are easier to implement in GLM.

statsmodels has a generic penalization setup that can be added to any models where fit is based on nonlinear optimization like those in scipy. This is experimental, not well advertised but forms the setup for internal reuse, for example in generalized additive models, GAM. We can combine an existing model with a provided penalization class, where the optimization assumes that the penalized loglikelihood function is smooth or smoothed, for example a smoothed SCAD penalty https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/_penalties.py#L314

Experimental means that is tested for several cases, but might not work with all models with which it could be combined, or might need changes to make it work correctly for those models. Also, some API decisions and options are still open questions and subject to change.

e.g. to define PenalizedGLM we can just subclass the PenalizedMixin and GLM, and provide family and penalty instance when creating the model:

class GLMPenalized(PenalizedMixin, GLM):
    pass

penalty = smpen.SCADSmoothed(0.1, c0=0.0001)
mod = GLMPenalized(y, x, family=family.Gaussian(), penal=penalty)

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/tests/test_penalized.py#L34

links https://github.com/statsmodels/statsmodels/pull/4576 PR that added PenalizedMixin https://github.com/statsmodels/statsmodels/pull/4683 PR that uses it for ultra-high screening https://github.com/statsmodels/statsmodels/pull/5481 PR adding GAM