pythonregressionstatsmodelspatsy

convert cost function to statsmodels formula


I want to fit some data to a curve, using this as a cost function:

def cost_func(x):
    return ((unknown_conc-x[1]*(x[0]*conc_A+
           (1-x[0])*conc_B))**2).sum()

It works when using scipy.optimize, but I want to use statsmodels instead. However I'm struggling with defining a statsmodels formula. Do you have any ideas how to do this?

I tried something like this, but it does not work with this x*A + (1-x)*B:

result = sm.ols(formula="A ~ I(B + C) -1", data=df).fit()

Solution

  • Statsmodels/patsy formulas are a language for writing linear models, so you need to find a way to phrase your problem as a formula where the predicted value is linear function of the parameters you want to fit.

    In this case, you're doing least-squares fitting where the prediction is (using Python syntax):

    x[1]*(x[0]*conc_A + (1 - x[0])*conc_B)
    

    Expanding the terms, we get:

    x[1]*x[0]*conc_A + x[1]*(1 - x[0])*conc_B
    

    Let's define new parameters param0 = x[1]*x[0] and param1 = x[1]*(1 - x[0]). Now our prediction becomes

    param0*conc_A + param1*conc_B
    

    Notice that these are invertible, i.e. these equalities hold:

    x[0] = param0 / (param0 + param1)
    x[1] = param0 + param1
    

    so this reparametrization isn't changing the underlying model we're fitting, it's just changing how we represent it. But the new representation is linear in the parameters, so now we can convert it to a statsmodels/patsy formula:

    "conc_A + conc_B - 1"
    

    And finally, let's put the value we're fitting our predictions against in the formula, giving:

    result = sm.ols("unknown_conc ~ conc_A + conc_B - 1", data=df).fit()
    

    If you fit this you'll get values for param0 and param1, and if you use the equations above you can convert these back to x[0] and x[1] values to compare to what you were getting before.