As I understand, numpy.linalg.lstsq
and sklearn.linear_model.LinearRegression
both look for solutions x
of the linear system Ax = y
, that minimise the resdidual sum ||Ax - y||
.
But they don't give the same result:
from sklearn import linear_model
import numpy as np
A = np.array([[1, 0], [0, 1]])
b = np.array([1, 0])
x , _, _, _ = np.linalg.lstsq(A,b)
x
Out[1]: array([ 1., 0.])
clf = linear_model.LinearRegression()
clf.fit(A, b)
coef = clf.coef_
coef
Out[2]: array([ 0.5, -0.5])
What am I overlooking?
Both of them are implemented by LPACK gelsd.
The difference is that linear_model.LinearRegression
will do data pre-process (default) as below for input X (your A). But np.linalg.lstsq
don't. You can refer to the source code of LinearRegression for more details about the data pre-process.
X = (X - X_offset) / X_scale
If you don't want the data pre-process, you should set fit_intercept=False
.
Briefly speaking, if you normalize your input before linear regression, you will get the same result by both linear_model.LinearRegression
and np.linalg.lstsq
as below.
# Normalization/Scaling
from sklearn.preprocessing import StandardScaler
A = np.array([[1, 0], [0, 1]])
X_scaler = StandardScaler()
A = X_scaler.fit_transform(A)
Now A is array([[ 1., -1.],[-1., 1.]])
from sklearn import linear_model
import numpy as np
b = np.array([1, 0])
x , _, _, _ = np.linalg.lstsq(A,b)
x
Out[1]: array([ 0.25, -0.25])
clf = linear_model.LinearRegression()
clf.fit(A, b)
coef = clf.coef_
coef
Out[2]: array([ 0.25, -0.25])