pythonmachine-learningscikit-learnlogistic-regression

Predict() with sklearn LogisticRegression and RandomForest models always predict minority class (1)


I'm building a Logistic Regression model to predict if a transaction is valid (1) or not (0) with a dataset of just 150 observations. My data is distributed as follows between the two classes:

I am using two predictors (both numerical). Despite the data being mostly 0's, my classifier only predicts 1's for every transaction in my test set even though most of them should be 0. The classifier never outputs a 0 for any observation.

Here is my entire code:

# Logistic Regression
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import scipy
from scipy.stats import spearmanr
from pylab import rcParams
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing

address = "dummy_csv-150.csv"
trades = pd.read_csv(address)
trades.columns=['location','app','el','rp','rule1','rule2','rule3','validity','transactions']
trades.head()

trade_data = trades.ix[:,(1,8)].values
trade_data_names = ['app','transactions']

# set dependent/response variable
y = trades.ix[:,7].values

# center around the data mean
X= scale(trade_data)

LogReg = LogisticRegression()

LogReg.fit(X,y)
print(LogReg.score(X,y))

y_pred = LogReg.predict(X)

from sklearn.metrics import classification_report

print(classification_report(y,y_pred)) 

log_prediction = LogReg.predict_log_proba(
    [
       [2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]
    ])
prediction = LogReg.predict([[2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]])

My model is defined as:

LogReg = LogisticRegression()  
LogReg.fit(X,y)

where X looks like this :

X = array([[1, 345],
       [1, 222],
       [1, 500],
       [2, 120]]....)

and Y is just 0 or 1 for each observation.

Normalized X that gets passed to the model is this:

[[-1.67177659  0.14396503]
 [-1.67177659 -0.14538932]
 [-1.67177659  0.50859856]
 [-1.67177659 -0.3853417 ]
 [-1.67177659 -0.43239119]
 [-1.67177659  0.743846  ]
 [-1.67177659  4.32195953]
 [ 0.95657805 -0.46062089]
 [ 0.95657805 -0.45591594]
 [ 0.95657805 -0.37828428]
 [ 0.95657805 -0.52884264]
 [ 0.95657805 -0.20420118]
 [ 0.95657805 -0.63705646]
 [ 0.95657805 -0.65587626]
 [ 0.95657805 -0.66763863]
 [-0.35759927 -0.25125067]
 [-0.35759927  0.60975496]
 [-0.35759927 -0.33358727]
 [-0.35759927 -0.20420118]
 [-0.35759927  1.37195666]
 [-0.35759927  0.27805607]
 [-0.35759927  0.09456307]
 [-0.35759927  0.03810368]
 [-0.35759927 -0.41121892]
 [-0.35759927 -0.64411389]
 [-0.35759927 -0.69586832]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.53825254]
 [ 0.95657805 -0.53354759]
 [ 0.95657805 -0.52413769]
 [ 0.95657805 -0.57589213]
 [ 0.95657805  0.03810368]
 [ 0.95657805 -0.66293368]
 [ 0.95657805  2.86107294]
 [-1.67177659  0.14396503]
 [-1.67177659 -0.14538932]
 [-1.67177659  0.50859856]
 [-1.67177659 -0.3853417 ]
 [-1.67177659 -0.43239119]
 [-1.67177659  0.743846  ]
 [-1.67177659  4.32195953]
 [ 0.95657805 -0.46062089]
 [ 0.95657805 -0.45591594]
 [ 0.95657805 -0.37828428]
 [ 0.95657805 -0.52884264]
 [ 0.95657805 -0.20420118]
 [ 0.95657805 -0.63705646]
 [ 0.95657805 -0.65587626]
 [ 0.95657805 -0.66763863]
 [-0.35759927 -0.25125067]
 [-0.35759927  0.60975496]
 [-0.35759927 -0.33358727]
 [-0.35759927 -0.20420118]
 [-0.35759927  1.37195666]
 [-0.35759927  0.27805607]
 [-0.35759927  0.09456307]
 [-0.35759927  0.03810368]
 [-0.35759927 -0.41121892]
 [-0.35759927 -0.64411389]
 [-0.35759927 -0.69586832]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.53825254]
 [ 0.95657805 -0.53354759]
 [ 0.95657805 -0.52413769]
 [ 0.95657805 -0.57589213]
 [ 0.95657805  0.03810368]
 [ 0.95657805 -0.66293368]
 [ 0.95657805  2.86107294]
 [-1.67177659  0.14396503]
 [-1.67177659 -0.14538932]
 [-1.67177659  0.50859856]
 [-1.67177659 -0.3853417 ]
 [-1.67177659 -0.43239119]
 [-1.67177659  0.743846  ]
 [-1.67177659  4.32195953]
 [ 0.95657805 -0.46062089]
 [ 0.95657805 -0.45591594]
 [ 0.95657805 -0.37828428]
 [ 0.95657805 -0.52884264]
 [ 0.95657805 -0.20420118]
 [ 0.95657805 -0.63705646]
 [ 0.95657805 -0.65587626]
 [ 0.95657805 -0.66763863]
 [-0.35759927 -0.25125067]
 [-0.35759927  0.60975496]
 [-0.35759927 -0.33358727]
 [-0.35759927 -0.20420118]
 [-0.35759927  1.37195666]
 [-0.35759927  0.27805607]
 [-0.35759927  0.09456307]
 [-0.35759927  0.03810368]
 [-0.35759927 -0.41121892]
 [-0.35759927 -0.64411389]
 [-0.35759927 -0.69586832]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.53825254]
 [ 0.95657805 -0.53354759]
 [ 0.95657805 -0.52413769]
 [ 0.95657805 -0.57589213]
 [ 0.95657805  0.03810368]
 [ 0.95657805 -0.66293368]
 [ 0.95657805  2.86107294]
 [-1.67177659  0.14396503]
 [-1.67177659 -0.14538932]
 [-1.67177659  0.50859856]
 [-1.67177659 -0.3853417 ]
 [-1.67177659 -0.43239119]
 [-1.67177659  0.743846  ]
 [-1.67177659  4.32195953]
 [ 0.95657805 -0.46062089]
 [ 0.95657805 -0.45591594]
 [ 0.95657805 -0.37828428]
 [ 0.95657805 -0.52884264]
 [ 0.95657805 -0.20420118]
 [ 0.95657805 -0.63705646]
 [ 0.95657805 -0.65587626]
 [ 0.95657805 -0.66763863]
 [-0.35759927 -0.25125067]
 [-0.35759927  0.60975496]
 [-0.35759927 -0.33358727]
 [-0.35759927 -0.20420118]
 [-0.35759927  1.37195666]
 [-0.35759927  0.27805607]
 [-0.35759927  0.09456307]
 [-0.35759927  0.03810368]
 [-0.35759927 -0.41121892]
 [-0.35759927 -0.64411389]
 [-0.35759927 -0.69586832]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.57353966]
 [ 0.95657805 -0.53825254]
 [ 0.95657805 -0.53354759]
 [ 0.95657805 -0.52413769]
 [ 0.95657805 -0.57589213]
 [ 0.95657805  0.03810368]
 [ 0.95657805 -0.66293368]
 [ 0.95657805  2.86107294]
 [-0.35759927  0.60975496]
 [-0.35759927 -0.33358727]
 [-0.35759927 -0.20420118]
 [-0.35759927  1.37195666]
 [-0.35759927  0.27805607]
 [-0.35759927  0.09456307]
 [-0.35759927  0.03810368]]

and Y is:

[0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0
 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0
 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]

The model metrics are:

             precision    recall  f1-score   support

          0       0.78      1.00      0.88        98
          1       1.00      0.43      0.60        49

avg / total       0.85      0.81      0.78       147

with a Score of 0.80

When I run model.predict_log_proba(test_data) I get probability intervals that look like this:

array([[ -1.10164032e+01,  -1.64301095e-05],
       [ -2.06326947e+00,  -1.35863187e-01],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00],
       [            -inf,   0.00000000e+00]])

My test set is and all but 2 should be 0 but they're all classified as 1. This happens for every test set, even ones that have values the model trained on.

[2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]

I found a similar question here: https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0 but in this question, the problem seemed to be that the data was mostly 1's so it made sense the model would ouput 1s. My case is the opposite because the train data is mostly 0's but for some reason my model always outputs 1's for everything even though 1's are relatively few. I also tried a Random Forest Classifier to see if the model was wrong but the same thing happened. Perhaps it is my data but I don't know what's wrong with it since it meets all assumptions.

What could be wrong? The data meets all assumptions for the logistic model (both predictors are independent, output is binary, no missing data points).


Solution

  • You are not scaling your test data. You are correct to be scaling your train data when you do this:

    X= scale(trade_data)
    

    After you train your model you do not do the same with the test data:

    log_prediction = LogReg.predict_log_proba(
    [
       [2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]
    ])
    

    The coefficients for your model were built expecting normalized inputs. Your test data is not normalized. Any positive coefficient of the model will be multiplied by a huge number since your data isn't scaled likely driving your predicted values to all be 1.

    A general rule is whatever transformations you do on your training set should likewise be done on your testing set. You should also apply the same transformation on your training set with your test set. Instead of:

    X = scale(trade_data)
    

    You should create a scaler from your training data like this:

    scaler = StandardScaler().fit(trade_date)
    X = scaler.transform(trade_data)
    

    Then later apply that scaler to your test data:

    scaled_test = scaler.transform(test_x)