pythonmachine-learningscikit-learnlogistic-regression

Logistic Regression with just ONE numeric feature


What is the right way to use scikit-learn's LogisticRegression solver when you have just one numeric feature?

I ran a simple example that I found hard to explain. Can anyone please explain what I am doing wrong here?

import pandas
import numpy as np
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()

lr.fit(X, Y)
print ("2 --> {0}".format(lr.predict(2)))
print ("4 --> {0}".format(lr.predict(4)))

This is the output I get when the script finishes running. Shouldn't the prediction for 4 be 0 since according to the Gaussian distribution 4 is nearer to the distribution that according to the test set is classified as 0?

2 --> [0]
4 --> [1]

What is the approach Logistic Regression takes when you have just one column with numeric data?


Solution

  • You're handling a single feature correctly, but you're incorrectly assuming that just because 4 is close to the 0 class features that it would also be predicted as such

    You can plot your training data along with the sigmoid function, assuming a threshold of y=0.5 for classification, and using the learned coefficients and intercepts from your regression model:

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LogisticRegression
    
    X = [1, 2, 3, 10, 11, 12]
    X = np.reshape(X, (6, 1))
    Y = [0, 0, 0, 1, 1, 1]
    Y = np.reshape(Y, (6, 1))
    
    lr = LogisticRegression()
    lr.fit(X, Y)
    
    plt.figure(1, figsize=(4, 3))
    plt.scatter(X.ravel(), Y, color='black', zorder=20)
    
    def model(x):
        return 1 / (1 + np.exp(-x))
    
    X_test = np.linspace(-5, 15, 300)
    loss = model(X_test * lr.coef_ + lr.intercept_).ravel()
    
    plt.plot(X_test, loss, color='red', linewidth=3)
    plt.axhline(y=0, color='k', linestyle='-')
    plt.axhline(y=1, color='k', linestyle='-')
    plt.axhline(y=0.5, color='b', linestyle='--')
    plt.axvline(x=X_test[123], color='b', linestyle='--')
    
    plt.ylabel('y')
    plt.xlabel('X')
    plt.xlim(0, 13)
    plt.show()
    

    Here is what the sigmoid function looks like in your case:

    enter image description here

    Zoomed in a bit:

    enter image description here

    For your particular model, the value of X when Y is at the 0.5 classification threshold is somewhere between 3.161 and 3.227. You can check this by comparing the loss and X_test arrays (X_test[123] is the X value associated with the upper bound - you can use some function optimization method to get an exact value, if you want)

    So the reason why 4 is being predicted as class 1 is because 4 is above that bound for when Y == 0.5

    You can further show this with the following:

    print ("2 --> {0}".format(lr.predict(2)))
    print ("3 --> {0}".format(lr.predict(3)))
    print ("3.1 --> {0}".format(lr.predict(3.1)))
    print ("3.3 --> {0}".format(lr.predict(3.3)))
    print ("4 --> {0}".format(lr.predict(4)))
    

    Which will print out the following:

    2 --> [0]
    3 --> [0]
    3.1 --> [0]  # Below threshold
    3.3 --> [1]  # Above threshold
    4 --> [1]