pythonscikit-learnlogistic-regression

Improve Logistic regression with sklearn


I am doing a logistic regression using sklearn but I saw that the fit is actually very steep, going straight from 0 to 1 (see image). Can anyone tell me which parameter should I work on so the fit would be smoother?

X = curve_data[0][0]
y = curve_data[0][1]
clf = LogisticRegression(C=1e5, fit_intercept=True)
clf.fit(X.reshape(-1,1), y)
X_test = np.linspace(0, 1000, 5000)
a_voir = clf.predict(X_test.reshape(-1,1))   #test a effacer
loss = expit(X_test * clf.coef_ + clf.intercept_).ravel()
# midpoint[i] = (logit(0.5)-clf.intercept_)/clf.coef_
plt.figure()
plt.scatter(data_1a_0V_VCASN_48_all[5][0], data_1a_0V_VCASN_48_all[5][1])
plt.plot(X_test, a_voir)
plt.show()

enter image description here


Solution

  • Here is a small example for you, to see what happens when data is not scaled:

    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.linear_model import LogisticRegression
    %matplotlib notebook
    import matplotlib.pyplot as plt
    
    np.random.seed(42) # for reproducibility
    X = np.random.rand(100, 1) * 1000  # generate a random vector that ranges from 0 to 1000
    X_test = np.linspace(0, 1000, 5000).reshape(-1, 1) # generate testing data
    y = (X > 500)  # generate binary classification labels
    y_int = y.astype(int).flatten() # convert to 0 and 1
    scaler_X = MinMaxScaler() # scaler
    # scaled
    X_scaled = scaler_X.fit_transform(X) # scale X
    clf_scaled = LogisticRegression(C=1e4, fit_intercept=True) # logistic with scaled
    clf_scaled.fit(X_scaled, y_int) # fit logistic with scaled
    X_test_scaled = scaler_X.transform(X_test) # scale test data
    probabilities_scaled = clf_scaled.predict_proba(X_test_scaled)[:, 1] # get probabilities of test data
    a_voir_scaled = probabilities_scaled * (np.max(y_int) - np.min(y_int)) + np.min(y_int) # reverse normalizing
    midpoint_scaled = (np.log(0.5) - clf_scaled.intercept_) / clf_scaled.coef_[0] # get scaled midpoint
    midpoint_scaled = scaler_X.inverse_transform(midpoint_scaled.reshape(-1, 1)) # get scaled midpoint on unscaled space
    
    # unscaled
    clf_unscaled = LogisticRegression(C=1e4, fit_intercept=True) # logistic with unscaled
    clf_unscaled.fit(X, y_int) # fit logistic with unsacled
    probabilities_unscaled = clf_unscaled.predict_proba(X_test)[:, 1] # get probabilities of test data, unscaled
    a_voir_unscaled = probabilities_unscaled * (np.max(y_int) - np.min(y_int)) + np.min(y_int) # reverse normalizing
    midpoint_unscaled = (np.log(0.5) - clf_unscaled.intercept_) / clf_unscaled.coef_[0] # get unscaled midpoint
    
    
    # Plot the original data and the logistic regression curve
    plt.scatter(X, y_int, label='Original')
    plt.plot(X_test, a_voir_scaled, label='scaled')
    plt.plot(X_test, a_voir_unscaled, label='unscaled')
    plt.axvline(midpoint_scaled, linestyle = "--", label = "midpoint from scaled")
    plt.axvline(midpoint_unscaled, linestyle  = "-.", label = "midpoint from unscaled")
    plt.grid()
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend(loc = "upper left", ncols = 1)
    plt.show()
    

    It is imperative to scale data unless you have a reason to preserve the variance that can be found inside your variables. You can see from the overfitting of the orange curve that going without any scaling can damage the predictions. Think about scaling as a way to hear the sound of a needle drop in a silent room, as opposed to doing so in a heavy metal concert. Likewise, the fit focuses on high variances found in the data and fails to find fine details. Play around with the C in both examples and see the difference it makes, you will find that going as low as 100 is also find for the scaled case.

    The results:

    results