Here's my sample program:
import numpy as np
import pandas as pd
import statsmodels
from statsmodels.formula.api import ols
df = pd.DataFrame({"z": [1,1,1,2,2,2,3,3,3],
"x":[0,1,2,0,1,2,0,1,2],
"y":[0,2,4,3,5,7,7,9,11]
})
model = ols("y ~ x + z + I(z**2)", df).fit()
model.params
newdf = pd.DataFrame({"z": [4,4,4,5,5,5],
"x":[0,1,2,0,1,2]
})
model.predict(newdf)
You'll notice, if you run this, that model.params is a pandas Series with indices the same as the right-hand side of the formula, except with an additional entry: "Intercept"
> Out[2]:
> Intercept -2.0
> x 2.0
> z 1.5
> I(z ** 2) 0.5
> dtype: float64
And, using some internal functionality I can't determine, the RegressionResults object's .predict() can recognize the column headers from newdf, match them up (including the patsy syntax "I(z**2)"), add the intercept, and return an answer Series. (this is the last line of my sample code)
This seems convenient! Better than writing out my formula again in python/numpy code whenever I want to evaluate slight variations on it. I feel like there should be some way for me to construct a similar pd.Series for formula coefficients, instead of having created it through a model and fit. Then I should be able to apply this to an appropriate dataframe as a way of evaluating functions.
My attempts to figure out how statsmodel is doing this haven't worked, I haven't found anything obvious in the related function doc pages, in patsy, nor can I seem to enter this section of the source code while debugging. Anyone have any idea how to set this up?
I eventually pieced together one way of doing this.
def predict(coeffs,datadf:pd.DataFrame)->np.array:
"""Apply a series (or df) of coefficents indexed by model terms to new data
:param coeffs: a series whose elements are coefficients and index are the formula terms
or a df whose column names are formula terms, and each row is a set of coefficients
:param datadf: the new data to predict on
"""
desc = patsy.ModelDesc([],[patsy.Term([]) if column=="Intercept" else patsy.Term([patsy.EvalFactor(column)]) for column in coeffs.index] )
dmat = patsy.dmatrix(desc,datadf)
return np.dot(dmat, coeffs.T)
newdf["y"] = predict(model.params,newdf)
The reason this seemed so appealing to me, in case anyone is baffled, is that I was fitting data piecewise using df.groupby("column").apply(FitFunction)
. It seemed like having FitFunction()
return the model.params
series would be the cleanest approach within the pandas paradigm.