pythonstatsmodelspatsy

How to modify a liner regression in python 3.6?


The code looks like:

import statsmodels.formula.api as smf

df = pd.read_csv('reg_data.csv')
f = 'inf ~ rh*temp*tl*Tt*C(location)'   
lm = smf.ols(formula = f, data=df).fit()

But it always gives me an error:

numbers besides '0' and '1' are only allowed with **

The data in the file are all different numbers. Some have 2 decimals some have more. Any idea to solve this problem and get regression summary? (by lm.summary())

Thank you in advance!


Solution

  • Oh, you found an interesting bug.

    First, the error message isn't talking about the numbers in your data. That error message happens when you type a literal number into your formula, like in "y ~ 3*x" it will raise that error because it doesn't like the 3.

    But your formula doesn't have any numbers in it, so what's going on? Well, you're hitting a bug in the formula parser: the way it checks if something is a number, is by checking if you can pass it to int(...) or float(...) and get a value back. But in Python, float("inf") is a valid expression that returns the floating point value representing infinity, even though plain inf alone isn't a number in Python.

    I filed the bug here: https://github.com/pydata/patsy/issues/118

    And the workaround for now is to avoid using the string inf as the name for one of your columns. (You should probably avoid nan too, for the same reason.) Sorry about that!