I've been struggling to justify why I'm getting intercept_=0.0
with LogisticRegression from scikit-learn. The fitted Logistic Regression has the following parameters:
LogisticRegression(C=0.0588579519026603, class_weight='balanced',
dual=False, fit_intercept=True, intercept_scaling=6.2196752179914165,
max_iter=100, multi_class='ovr', n_jobs=1, penalty='l1',
random_state=1498059397, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
The dataset I'm using has the following characteristics:
I started by exploring the coef_
attributes of the Logistic Regression and they are the following:
array([[-0.11210483, 0.09227395, 0.23526487, 0.1740976 , 0. ,
-0.3282085 , -0.41550312, 1.67325241, 0. , 0. ,
-0.06987265, 0. , -0.03053099, 0. , 0.09354742,
0.06188271, -0.24618392, 0.0368765 , 0. , 0. ,
-0.31796638, 1.75208672, -0.1270747 , 0.13805016, 0. ,
0.2136787 , -0.4032387 , -0.00261153, 0. , 0.17788052,
-0.0167915 , 0.34149755, 0.0233405 , -0.09623664, -0.12918872,
0. , 0.47359295, -0.16455172, -0.03106686, 0.00525001,
0.13036978, 0. , 0. , 0.01318782, -0.10392985,
0. , -0.91211158, -0.11622266, -0.18233443, 0.43319013,
-0.06818055, -0.02732619, 0. , -0.09166496, 0.03753666,
0.03857431, 0. , -0.02650828, 0.19030955, 0.70891911,
-0.07383034, -1.29428322, -0.69191842, 0. , 0.43798269,
-0.66869241, 0. , 0.44498888, -0.08931519]])
where we can see some zeros (expected due to L1 penalty, right?) along with intercept_=0.0
.
I would like to add that I tried with class_weight=None
and I get intercept_ != 0.0
.
What could be the reason for this intercept_=0.0
? Is the intercept being regularized as well, and happens to be set to zero (as any other coefficient of coef_
)? Was it mere "luck"? Is it due to my dataset?
From the docstring
on the intercept_scaling
parameter to LogisticRegression
:
intercept_scaling : float, default 1.
Useful only when the solver ‘liblinear’ is used and self.fit_intercept is set to True. In this case, x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equal to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic_feature_weight.
Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
Why is this normal practice? The intercept term is technically just the coefficient to a column vector of 1s that you append to your X/feature terms.
For example, using simple linear regression, say you have a dataset of features X with 2 features and 10 samples. If you were to use scipy.linalg.lstsq to get the coefficients including the intercept, you'd first want to use something like statsmodels.tools.tools.add_constant
to append a column of 1s to your features. If you didn't append the column of 1s, you'd only get 2 coefficients. If you did append, you'd get a third "coefficient" which is just your intercept.
The easy way to tie that back is to think of the predicted values. The intercept term multiplied by a column of 1s is just itself--i.e. you're adding the intercept (times one) to the summed product of the other coefficients and features, to get your nx1 array of predicted values.