pythonnumpymachine-learningstatisticsstatsmodels

Statistical learning confusion table variable


I am getting an extra variable in my confusion table, not sure where it's coming from. The Dataset 'Default' has the following columns: default, student, income, balance The variable 'default' has two values: 'Yes' and 'No'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from ISLP import confusion_table



Default = load_data('Default')
vars = Default.columns.drop(['default'])
y = Default['default'] == 'Yes'
design = MS(vars)
X = design.fit_transform(Default)
glm = sm.GLM(y,
             X,
             family = sm.families.Binomial())
results = glm.fit()
summarize(results)
probs = results.predict()
labels = np.array(['No']*10000)
labels[probs>0.5] = 'Yes'
confusion_table(labels,Default.default)

In the output, I get a 3x3 table with the variables 'No', 'Yes' and 'Ye'

I want the confusion table values to be only 'Yes' and 'No'. Somehow, the numpy.array 'labels' is set to 'Ye' instead of 'Yes'.


Solution

  • Numpy may be inferring the datatype to be 2 characters for labels = np.array(['No']*10000) since all elements of the array have two characters.

    Try labels = np.array(['No']*10000, dtype='<U3')