I'm trying to make a GLM do what an OLS does, just to get a baseline understanding of GLM. But it doesn't seem to do what I'm after. Consider this code:
import numpy as np
import statsmodels.api as sm
from scipy import stats
print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
aInput = np.arange(10)
print(aInput)
aLinear = aInput.copy() * 0.25 + 5
print(aLinear)
print("### This is OLS to show clearly what we're after: ####################")
aInputConst = sm.add_constant(aInput)
model = sm.OLS(aLinear, aInputConst)
results = model.fit()
print(results.params)
print("This is GLM which looks nothing like what I expect: ##################")
model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)
print("This is GLM with the constant, but it just fails: ####################")
# vvvvvvvvvvv
model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)
Now consider the output:
### This is dummy data that clearly is y = 0.25 * x + 5: #############
[0 1 2 3 4 5 6 7 8 9]
[5. 5.25 5.5 5.75 6. 6.25 6.5 6.75 7. 7.25]
### This is OLS to show clearly what we're after: ####################
[5. 0.25]
This is GLM which looks nothing like what I expect: ##################
[0. 1.03947368 2.07894737 3.11842105 4.15789474 5.19736842
6.23684211 7.27631579 8.31578947 9.35526316]
This is GLM with the constant, but it just fails: ####################
Traceback (most recent call last):
File "min.py", line 26, in <module>
result = model.fit()
File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1065, in fit
cov_kwds=cov_kwds, use_t=use_t, **kwargs)
File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1211, in _fit_irls
raise PerfectSeparationError(msg)
statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available
Observe:
My question is:
I'm clearly confused. All help appreciated!
Got it! Two things:
Yes, you do need that "sm.add_constant()" call. As noted on a number of sites I read, it's a strange behavior of both StatsModels and Sklearn, but if you want an intercept value, you need a constant input variable.
And the issue was the dummy date was just too good. By adding some noise to the data so that it jitters around the regression line it works fine. Apparently, if the data is to perfect, the machine has a hard time fitting it. I don't know why. Just my empirical observation.
The working code with comments on the changes is:
import numpy as np
import statsmodels.api as sm
from scipy import stats
print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
aInput = np.arange(10)
print(aInput)
noise = (np.random.rand(10) - 0.5) / 10 + 1 # New.
aLinear = aInput.copy() * 0.25 * noise + 5 # Changed: add noise.
print(aLinear)
print("### This is OLS to show clearly what we're after: ####################")
aInputConst = sm.add_constant(aInput)
model = sm.OLS(aLinear, aInputConst)
results = model.fit()
print(results.params)
# print("This is GLM which looks nothing like what I expect: ##################")
# model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
# result = model.fit()
# y_hat = result.predict(aInput)
# print(y_hat)
print("This is GLM with the constant, but it just fails: ####################")
# vvvvvvvvvvv
model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInputConst) # Changed: use the "Const" version.
print(y_hat)
And the output is:
### This is dummy data that clearly is y = 0.25 * x + 5: #############
[0 1 2 3 4 5 6 7 8 9]
[5. 5.25190469 5.48057897 5.74306435 6.00161857 6.20957869
6.43812791 6.71636422 7.09118653 7.25882849]
### This is OLS to show clearly what we're after: ####################
[4.98249325 0.25258489]
This is GLM with the constant, but it just fails: ####################
[4.98249325 5.23507814 5.48766302 5.74024791 5.9928328 6.24541768
6.49800257 6.75058746 7.00317235 7.25575723]
And as noted, the one thing I was after is seeing the coefficients which are contrived in the data, and sure enough, they come out right:
[4.98249325 0.25258489]
So yes, you need the constant column. And if you're test data is too clean, jitter it a bit so it's more like messy real-world data.