pythonscikit-learnlogistic-regressionpredictionroc

Logistic Regression Model - Prediction and ROC AUC


I am building a Logistic Regression using statsmodels (statsmodels.api) and would like to understand how to get predictions for the test dataset. This is what I have so far:

x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(X, df[target_var], test_size=0.3)

logit = sm.Logit(
     y_train_data, 
     x_train_data
)

result = logit.fit()
result.summary()

What is the best way to print the predictions for y_train_data and y_test_data for below? I am unsure which Regression metrics to use or to import in this case:

in_sample_pred = result.predict(x_train_data)
out_sample_pred = result.predict(x_test_data)

Also, what's the best way to calculate ROC AUC score and plot it for this Logistic Regression model (through scikit-learn package)?

Thanks


Solution

  • Maybe your confusion is that Statsmodels Logit is a Logistic Regression model used for classification, and it already predicts a probability, which is to be used in sklearn's roc_auc_score.

    To predict based on your x_test_data, all you have to do is:

    x_test_predicted = result.predict(x_test_data)
    
    print(x_test_predicted)
    

    I guess if you wanted to have a good grasp of the predictions, you could look at a dataframe:

    import pandas as pd 
    
    df_test_predictions = pd.DataFrame({
        'x_test_predicted': x_test_predicted, 
        'y_test': y_test_data 
        })
    

    Then to calculate ROC-AUC, you can do:

    from sklearn.metrics import roc_auc_score
    
    score = roc_auc_score(y_test_data, x_test_predicted)
    print(score)
    

    Finally, for the plot, refer to this previously answered question. There barebones is the following:

    import sklearn.metrics as metrics
    import matplotlib.pyplot as plt 
    
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    
    plt.plot(fpr, tpr)