As far as I know the cost is minimized using the Gradient Descent algorithm by updating the weights(repeat until convergence), In case of linear regression we have
m : slope
c : intercept (constant value)
import numpy as np
from sklearn.linear_model import LinearRegression
x= np.asarray([(i+np.random.randint(1,7)) for i in range(1,31)]).reshape(-1,1)
y= np.dot([3],x.T) + 5
reg = LinearRegression()
reg.fit(x,y)
I have used the sklearn library but here we're not taking iterations and the learning rate as input at the time of initialization or doing reg.fit()
.
Why does not sklearn's linear regression asks for iterations and learning rate? Is it set with some default value or it uses some other method?
Just to add a demonstration to Muhammed's answer (you should accept it, btw), here is an example
import numpy as np
np.random.seed(12) # Just to have a reproducible example
from sklearn.linear_model import LinearRegression
X=np.random.normal(0,1,(20,10)) # Or any array of X you want. Just a mre
Y=np.random.normal(0,1,(20,))
# Learning with LinearRegerssion, without intercept (so learning M, Y=LX, L begin an array of coefficients)
reg=LinearRegression(fit_intercept=False)
reg.fit(X,Y)
print(reg.coef_) # The coefficients.
# With my random example and seed, shows
#[-0.06999151 -0.0586993 0.77203288 0.11928812 0.05656448 -0.37281412
# -0.35447307 0.06957882 0.26701851 0.06950227]
# Meaning that model is to predict that Y=-0.06999151*X₀ -0.0586993*X₁ + ...
# For example prediction for a given X
Xtest=np.random.randint(-5,5, (1,10)) # A single sample of 10 features
reg.predict(Xtest)
# returns 4.49749641
# Which is simply
sum(Xtest[0,i]*reg.coef_[i] for i in range(10))
# Or, using linear algebra operation
reg.coef_@Xtest[0]
# Now, Moore-Penrose's version
Coef = np.linalg.inv(X.T@X)@X.T@Y
print(Coef)
#[-0.06999151 -0.0586993 0.77203288 0.11928812 0.05656448 -0.37281412
# -0.35447307 0.06957882 0.26701851 0.06950227]
# See, same coefficients! Not "approximately the same". But the same...
# including the non significative decimal places, where you would expect some
# numerical error. Showing that it is really the same computation done, not an equivalent one
# prediction is likewise
Coef@Xtest[0]
So, no mystery here. LinearRegression is just a Moore-Penrose's pseudo inverse. Aka a least-square value. Aka an orthogonal projection (same thing: point P of subspace Vec(X₁,X₂,...) for which distance ‖X-P‖ is minimum is also the orthogonal projection of X onto subspace Vec(X₁,X₂,...)
And even if you have no recollection of notion such as subspace, Vec, Moore-Penrose, ... (I say "recollection" because, probably, if your are doing this kind of stuff, you probably had some math lesson at university/college degree at some point; and that is something that is taught in any scientific curiculum in the world... but that most people quickly forget later), at least, you can see that it is not an iterative process. Just a formula (XᵀX)⁻¹XᵀY
I've simplified my example here, because I've removed the intercept. But the intercept is just the coefficent to an extra "1" vector.
X1=np.ones((20,11))
X1[:,:10]=X
CoefI = np.linalg.inv(X1.T@X1)@X1.T@Y
# Returns
# array([-0.1068548 , -0.09027332, 0.73712907, 0.1136123 , 0.0904737 ,
# -0.36593051, -0.38649945, 0.02849317, 0.18063291, 0.05866195,
# -0.17597287])
regI=LinearRegression()
regI.fit(X,Y)
regI.coef_
#array([-0.1068548 , -0.09027332, 0.73712907, 0.1136123 , 0.0904737 ,
# -0.36593051, -0.38649945, 0.02849317, 0.18063291, 0.05866195])
# aka the 10 first coefficients (the one apply to the 10 "real" columns of X)
regI.intercept_
#-0.17597287204667314
# aka the 11th coefficient of Moore-Penrose's inverse. That is the one
# apply to the "all 1" vector.
# Comparison of prediction is almost as easy
regI.predict(Xtest)
CoefI[:10]@Xtest[0]+CoefI[10]
# both return the same 4.633604110000001
So, even with intercept, it is still just a linear algebra formula, not an interative process.
May be sklearn is more efficient. But that is not obvious with normal size dataset (with small examples, like my 20×10, direct Moore-Penrose is 10 times faster. But that is probably just because of the overhead of class initialization. But even with big datasets like 2000×1000 — still not huge tho — Moore Penrose is still 3 times faster. Maybe it is because sklearn ensure some better conditioning. Or maybe it works better with way bigger dataset with sparse values. I don't know). From a math perspective it does nothing more than a Moore-Penrose inverse. From an implementation perspective, it is not easy to exhibit example of what it does more (it is not faster, and I could not generate examples where it is more stable)