pythonlogistic-regressionroc

Roc curve and cut off point. Python


I ran a logistic regression model and made predictions of the logit values. I used this to get the points on the ROC curve:

 from sklearn import metrics
 fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

I know metrics.roc_auc_score gives the area under the ROC curve. Can anyone tell me what command will find the optimal cut-off point (threshold value)?


Solution

  • You can do this using the epi package in R, however I could not find similar package or example in Python.

    The optimal cut off point would be where “true positive rate” is high and the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.

    Python code:

    import pandas as pd
    import statsmodels.api as sm
    import pylab as pl
    import numpy as np
    from sklearn.metrics import roc_curve, auc
    
    # read the data in
    df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
    
    # rename the 'rank' column because there is also a DataFrame method called 'rank'
    df.columns = ["admit", "gre", "gpa", "prestige"]
    # dummify rank
    dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
    # create a clean data frame for the regression
    cols_to_keep = ['admit', 'gre', 'gpa']
    data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])
    
    # manually add the intercept
    data['intercept'] = 1.0
    
    train_cols = data.columns[1:]
    # fit the model
    result = sm.Logit(data['admit'], data[train_cols]).fit()
    print result.summary()
    
    # Add prediction to dataframe
    data['pred'] = result.predict(data[train_cols])
    
    fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
    roc_auc = auc(fpr, tpr)
    print("Area under the ROC curve : %f" % roc_auc)
    
    ####################################
    # The optimal cut off would be where tpr is high and fpr is low
    # tpr - (1-fpr) is zero or near to zero is the optimal cut off point
    ####################################
    i = np.arange(len(tpr)) # index for df
    roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
    roc.iloc[(roc.tf-0).abs().argsort()[:1]]
    
    # Plot tpr vs 1-fpr
    fig, ax = pl.subplots()
    pl.plot(roc['tpr'])
    pl.plot(roc['1-fpr'], color = 'red')
    pl.xlabel('1-False Positive Rate')
    pl.ylabel('True Positive Rate')
    pl.title('Receiver operating characteristic')
    ax.set_xticklabels([])
    

    The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.

    Output:

            1-fpr       fpr        tf     thresholds       tpr
      171  0.637363  0.362637  0.000433    0.317628     0.637795
    

    enter image description here

    Hope this is helpful.

    Edit

    To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.

    Python Code:

    def Find_Optimal_Cutoff(target, predicted):
        """ Find the optimal probability cutoff point for a classification model related to event rate
        Parameters
        ----------
        target : Matrix with dependent or target data, where rows are observations
    
        predicted : Matrix with predicted data, where rows are observations
    
        Returns
        -------     
        list type, with optimal cutoff value
            
        """
        fpr, tpr, threshold = roc_curve(target, predicted)
        i = np.arange(len(tpr)) 
        roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
        roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]
    
        return list(roc_t['threshold']) 
    
    
    # Add prediction probability to dataframe
    data['pred_proba'] = result.predict(data[train_cols])
    
    # Find optimal probability threshold
    threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
    print threshold
    # [0.31762762459360921]
    
    # Find prediction to the dataframe applying threshold
    data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)
    
    # Print confusion Matrix
    from sklearn.metrics import confusion_matrix
    confusion_matrix(data['admit'], data['pred'])
    # array([[175,  98],
    #        [ 46,  81]])