python scipy binning scipy-optimize-minimize discretization

In python, how to discretize continuous variable using accuracy as a criterion taking class into consideration

For a set of subjects I have a continuous variable with range 0-100 representing a quantification of a subject's state cont_attribute. For each subject I also have an ordinal variable representing reader annotation of subject's state as one of four states (e.g. 1, 2, 3, 4) class_label. Values for cont_attribute overlap between classes. My goal is to discretize cont_attribute so that agreement with class is optimized.

When discretizing cont_attribute, arbitrary thresholds x1, x2, x3 can be applied to the continuous variable directly, to yield bins of four ordinal categories and agreement with reader annotation class can be assessed:

cohen_kappa_score((pd.cut(df['cont_attribute'],bins=[0, x1, x2, x3, 100], labels=['1','2','3','4']).astype('int'))
, df['class_label'].astype('int'))

I have found several options for discretizatin of continuous variable such as Jenks natural breaks and sklearn Kmeans, though these options do not take into account class.

What I tried:

I attempted to optimize the function above to yield the maximal value using scipy.optimize.minimize. Here for each threshold between two classes, I use the minimum value of the larger class and the maximal value of the smaller classes as a range with which to find the respective optimal cutoff point between those classes. With this approach I run into a problem, prompting:

ValueError: bins must increase monotonically.

def objfunc(grid):
    x1, x2, x3 = grid
    return (cohen_kappa_score((pd.cut(df.cont_attribute,bins=[0, x1, x2, x3, 100],labels=['1','2','3','4'], duplicates='drop').astype('int'))
, df['class_label'].astype('int'))) * (-1);

grid = (slice(df[(df['class_label'] == 2)]['cont_attribute'].min(), df[(df['class_label'] == 1)]['cont_attribute'].max(), 0.5), (slice(df[(df['class_label'] == 3)]['cont_attribute'].min(), df[(df['class_label'] == 2)]['cont_attribute'].max(), 0.5), (slice(df[(df['class_label'] == 4)]['cont_attribute'].min(), df[(df['class_label'] == 3)]['cont_attribute'].max(), 0.5))
solution = brute(objfunc, grid, finish=None,full_output = True)
solution

In python, is there a straightforward way to optimize thresholds x1, x2, x3 taking agreement with class into account (supervised discretization)? Alternatively, how can the above function be rewritten to yield a maximum using scipy.optimize.minimize?

Solution

The error message is not too hard. The pandas cut method demands that the cut vector [0,x1,x2,x3,100] is strictly monotinic. By having some mechanism to make sure that no invalid values are passed to the cut function, we are safe. That is what I implemented below. To denote an invalid setting, it is customary to use np.inf since all other values are lower. Therefore, every minizmier would say such an invalid is undesirable as a solution. See below for the implementation. I also included all the imports and some data generation, so that it is simple to use the code. Please do so in future questions as well.

You might want to use more than 10 bins per dimension in the brute force search.

Also - the code is quite inefficient. Since it brute forces over all combinations of x1, x2, x3, but a lot of them are invalid (e.g. x2<=x1), you might want to parametrize the problem in (x1,x2-x1, x3-x2) instead, and search over nonnegative values in the second and third component.

Finally, the brute method is a minimizer, so you should return -cohen_kappa from the objective

#%%
import numpy as np
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from scipy.stats import truncnorm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.optimize import brute

#
# Generate Data
#
n = 1000
np.random.seed(0)
y = np.random.choice(4, p=[0.1, 0.3, 0.4, 0.2], size=n)
x = np.zeros(n)
for i in range(5):
    low = 0
    high = 100
    mymean = 20 * i
    myscale = 8
    a, b = (low - mymean) / myscale, (high - mymean) / myscale
    x[y == i] = truncnorm.rvs(a=a, b=b, loc=mymean, scale=myscale, size=np.sum(y == i))
data = pd.DataFrame({"cont_attribute": x, "class_label": y})

# make a loss function that accounts for the bad orderings
def loss(cuts):
    x1, x2, x3 = cuts
    if 0 >= x1 or x1 >= x2 or x2 >= x3 or x3 >= 100:
        return np.inf
    yhat = pd.cut(
        data["cont_attribute"],
        bins=[0, x1, x2, x3, 100],
        labels=[0, 1, 2, 3],
        # duplicates="drop",
    ).astype("int")
    return -cohen_kappa_score(data["class_label"], yhat)


# Compute the result via brute force
ranges = [(0, 100)] * 3
Ns=30
result = brute(func=loss, ranges=ranges, Ns=Ns)
print(result)
print(-loss(result))

# Evaluate the final result in a confusion matrix
x1, x2, x3 = result
data["class_pred"] = pd.cut(
    data["cont_attribute"],
    bins=[0, x1, x2, x3, 100],
    labels=[0, 1, 2, 3],
    duplicates="drop",
).astype("int")
mat = confusion_matrix(y_true=data['class_label'],y_pred=data['class_pred'])
plt.matshow(mat)
# Loop over data dimensions and create text annotations.
for i in range(4):
    for j in range(4):
        text = plt.text(j, i, mat[i, j],
                       ha="center", va="center", color="grey")
plt.xlabel('Predicted class')
plt.ylabel('True class')
plt.show()

# Evaluate result graphically
# inspect the data
fig,ax = plt.subplots(2,1)
sns.histplot(data=data, x="cont_attribute", hue="class_label",ax=ax[0],multiple='stack')
sns.histplot(data=data, x="cont_attribute", hue="class_pred",ax=ax[1],multiple='stack')
plt.show()

Regarding the use of scipy.optimize.minimize, that is not possible when using the cohen kappa as a ojective. Since it is not differentiable, it is not so easy to optimize over. Consider using a cross entropy loss function instead. But in that case, you would need a (parametric) model for the classification task.

A standard ordinal classifier is available in the ordinal regression package in statsmodels. It will be vastly faster than the brute method, but possibly less accurate when evaluated on cohen kappa. Going that route is probably what I would have done if going for a higher number of bins.