[SOLVED] Statistical learning confusion table variable

Statistical learning confusion table variable

I am getting an extra variable in my confusion table, not sure where it's coming from. The Dataset 'Default' has the following columns: default, student, income, balance The variable 'default' has two values: 'Yes' and 'No'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
from ISLP.models import (ModelSpec as MS,
                         summarize,
                         poly)
from ISLP import confusion_table



Default = load_data('Default')
vars = Default.columns.drop(['default'])
y = Default['default'] == 'Yes'
design = MS(vars)
X = design.fit_transform(Default)
glm = sm.GLM(y,
             X,
             family = sm.families.Binomial())
results = glm.fit()
summarize(results)
probs = results.predict()
labels = np.array(['No']*10000)
labels[probs>0.5] = 'Yes'
confusion_table(labels,Default.default)

In the output, I get a 3x3 table with the variables 'No', 'Yes' and 'Ye'

I want the confusion table values to be only 'Yes' and 'No'. Somehow, the numpy.array 'labels' is set to 'Ye' instead of 'Yes'.

Solution

Numpy may be inferring the datatype to be 2 characters for labels = np.array(['No']*10000) since all elements of the array have two characters.

Try labels = np.array(['No']*10000, dtype='<U3')