pythonstatsmodelsglm

Trying to get a Gaussian GLM to simply match OLS but is either very wrong or has perfect separation


I'm trying to make a GLM do what an OLS does, just to get a baseline understanding of GLM. But it doesn't seem to do what I'm after. Consider this code:

import numpy as np
import statsmodels.api as sm
from scipy import stats

print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
aInput = np.arange(10)
print(aInput)
aLinear = aInput.copy() * 0.25 + 5
print(aLinear)

print("### This is OLS to show clearly what we're after: ####################")
aInputConst = sm.add_constant(aInput)
model = sm.OLS(aLinear, aInputConst)
results = model.fit()
print(results.params)

print("This is GLM which looks nothing like what I expect: ##################")
model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)

print("This is GLM with the constant, but it just fails: ####################")
#                       vvvvvvvvvvv
model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
result = model.fit()
y_hat = result.predict(aInput)
print(y_hat)

Now consider the output:

### This is dummy data that clearly is y = 0.25 * x + 5: #############
[0 1 2 3 4 5 6 7 8 9]
[5.   5.25 5.5  5.75 6.   6.25 6.5  6.75 7.   7.25]
### This is OLS to show clearly what we're after: ####################
[5.   0.25]
This is GLM which looks nothing like what I expect: ##################
[0.         1.03947368 2.07894737 3.11842105 4.15789474 5.19736842
 6.23684211 7.27631579 8.31578947 9.35526316]
This is GLM with the constant, but it just fails: ####################
Traceback (most recent call last):
  File "min.py", line 26, in <module>
    result = model.fit()
  File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1065, in fit
    cov_kwds=cov_kwds, use_t=use_t, **kwargs)
  File "/export/home/jm43436e/.local/lib/python3.6/site-packages/statsmodels/genmod/generalized_linear_model.py", line 1211, in _fit_irls
    raise PerfectSeparationError(msg)
statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

Observe:

My question is:

I'm clearly confused. All help appreciated!


Solution

  • Got it! Two things:

    The working code with comments on the changes is:

    import numpy as np
    import statsmodels.api as sm
    from scipy import stats
    
    print("### This is dummy data that clearly is y = 0.25 * x + 5: #############")
    aInput = np.arange(10)
    print(aInput)
    noise = (np.random.rand(10) - 0.5) / 10 + 1 # New.
    aLinear = aInput.copy() * 0.25 * noise + 5 # Changed: add noise.
    print(aLinear)
    
    print("### This is OLS to show clearly what we're after: ####################")
    aInputConst = sm.add_constant(aInput)
    model = sm.OLS(aLinear, aInputConst)
    results = model.fit()
    print(results.params)
    
    #   print("This is GLM which looks nothing like what I expect: ##################")
    #   model = sm.GLM(aLinear, aInput, family=sm.families.Gaussian())
    #   result = model.fit()
    #   y_hat = result.predict(aInput)
    #   print(y_hat)
    
    print("This is GLM with the constant, but it just fails: ####################")
    #                       vvvvvvvvvvv
    model = sm.GLM(aLinear, aInputConst, family=sm.families.Gaussian())
    result = model.fit()
    y_hat = result.predict(aInputConst) # Changed: use the "Const" version.
    print(y_hat)
    
    

    And the output is:

    ### This is dummy data that clearly is y = 0.25 * x + 5: #############
    [0 1 2 3 4 5 6 7 8 9]
    [5.         5.25190469 5.48057897 5.74306435 6.00161857 6.20957869
     6.43812791 6.71636422 7.09118653 7.25882849]
    ### This is OLS to show clearly what we're after: ####################
    [4.98249325 0.25258489]
    This is GLM with the constant, but it just fails: ####################
    [4.98249325 5.23507814 5.48766302 5.74024791 5.9928328  6.24541768
     6.49800257 6.75058746 7.00317235 7.25575723]
    
    

    And as noted, the one thing I was after is seeing the coefficients which are contrived in the data, and sure enough, they come out right:

    [4.98249325 0.25258489]
    

    So yes, you need the constant column. And if you're test data is too clean, jitter it a bit so it's more like messy real-world data.