pythonnumpystatsmodels

statsmodels OLS yields different results than matrix algebra


I need to perform an OLS regression on python using numpy and matrix algebra.

The code I used is the following:

import numpy as np
coeff = np.linalg.inv(X.T@X)@X.T@y

where X is the matrix of independent variables and y is the vector of dependent variables. I compared these results to the estimates obtained with statsmodels OLS as follows:

import numpy as np
import statsmodels.api as sm
model = sm.OLS(y, X)
results = model.fit()
coeff = results.params

For some reason I get different results and I'm wondering why. I've also tried to write the matrix algebra form differently but it seems to always yield different results than statsmodels OLS

Here is an example that leads to different results:

    data = np.array([
    [1, 2, 3, 14],
    [2, 4, 5, 25],
    [3, 6, 7, 36],
    [4, 8, 9, 47],
    [5, 10, 11, 58]
])

# Independent variables (X) - First three columns
X = data[:, :-1]

# Dependent variable (Y) - Last column
y = data[:, -1]

statsmodels OLS produces a vector of coefficients while the matrix algebra form leads to an error since the matrix X.T@X is not invertible. The actual dataset, which I need the matrix algebra approach for, is significantly larger so I preferred to include this smaller example. For my specific case, the X.T@X matrix is invertible but the coefficients are considerably different than the statsmodels OLS coefficients. I assume it depends on the inverse of X.T@X and a different procedure to estimate coefficients between the two approaches, hence my question.


Solution

  • As per the statsmodels.regression.linear_model.OLS.fit documentation, the default method being used computes the inverse using the Moore–Penrose inverse. For your numpy method, you're computing the inverse using numpy's exact inverse method, np.linalg.inv. The Moore-Penrose inverse is implemented in np.linalg.pinv. Using that, the results for your example match.

    import numpy as np
    import statsmodels.api as sm
    
    data = np.array([[1, 2, 3, 14],
                     [2, 4, 5, 25],
                     [3, 6, 7, 36],
                     [4, 8, 9, 47],
                     [5, 10, 11, 58]])
    
    # Independent variables (X) - First three columns
    X = data[:, :-1]
    
    # Dependent variable (Y) - Last column
    y = data[:, -1:]
    
    model = sm.OLS(y, X)
    results = model.fit()
    coeff_sm = results.params[:,None]
    
    coeff_np = np.linalg.pinv(X.T@X)@X.T@y
    
    print(coeff_np)
    print(np.allclose(coeff_sm, coeff_np))
    

    Output:

    [[1.]
     [2.]
     [3.]]
    True