pythonscikit-learnlogistic-regression

sklearn logistic regression gives biased results?


I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually).

I run LR as follows: I have a matrix of called “success_fail” that has for each setting (row of the design matrix) the number of success and number of fail. I run LR as:

skdesign = np.vstack((design,design))
sklabel = np.hstack((np.ones(success_fail.shape[0]), 
                     np.zeros(success_fail.shape[0])))
skweight = np.hstack((success_fail['success'], success_fail['fail']))
logregN = linear_model.LogisticRegression(C=1, 
                                solver= 'lbfgs',fit_intercept=False)   
logregN.fit(skdesign, sklabel, sample_weight=skweight)

(sklearn version 0.18)

I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better.

Below, I plot the results for the 1000 different regressions for 2 different values of C: results for the different regressions for 2 different values of C

I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. enter image description here

Why is this happening? How can I fix it? Can I make sklearn regularize the intercept less?


Solution

  • Thanks to the lovely folks at the sklearn mailing list I found out the answer. As you can see in the Question I made a design matrix (including intercept), and then fit the model with fit_intercept = False set. This resulted in regularization of the intercept. Very stupid on my part! All I needed to do was remove the intercept from the design and remove fit_intercept = False.