pythonpandasscipypatsy

Replicate Scipy's RegressionResults.predict functionality


Here's my sample program:

import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
                   "x":[0,1,2,0,1,2,0,1,2],
                   "y":[0,2,4,3,5,7,7,9,11]
                   })
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params

newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
                   "x":[0,1,2,0,1,2]
                   })
model.predict(newdf)

You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"

>  Out[2]: 
>     Intercept   -2.0
>     x            2.0
>     z            1.5
>     I(z ** 2)    0.5
>     dtype: float64

And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)

This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.

My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging. Anyone have any idea how to set this up?


Solution

  • I eventually pieced together one way of doing this.

    def predict(coeffs,datadf:pd.DataFrame)->np.array:
        """Apply a series (or df) of coefficents indexed by model terms to new data
    
        :param coeffs: a series whose elements are coefficients and index are the formula terms
                    or a df whose column names are formula terms, and each row is a set of coefficients
        :param datadf: the new data to predict on
        """
        desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
    
        dmat = patsy.dmatrix(desc,datadf)
        return np.dot(dmat, coeffs.T)
    
    newdf["y"] = predict(model.params,newdf)
    

    The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction). It seemed like having FitFunction() return the model.params series would be the cleanest approach within the pandas paradigm.