For a set of subjects I have a continuous variable with range 0-100 representing a quantification of a subject's state cont_attribute
. For each subject I also have an ordinal variable representing reader annotation of subject's state as one of four states (e.g. 1, 2, 3, 4) class_label
. Values for cont_attribute
overlap between classes. My goal is to discretize cont_attribute
so that agreement with class is optimized.
When discretizing cont_attribute
, arbitrary thresholds x1
, x2
, x3
can be applied to the continuous variable directly, to yield bins of four ordinal categories and agreement with reader annotation class can be assessed:
cohen_kappa_score((pd.cut(df['cont_attribute'],bins=[0, x1, x2, x3, 100], labels=['1','2','3','4']).astype('int'))
, df['class_label'].astype('int'))
I have found several options for discretizatin of continuous variable such as Jenks natural breaks and sklearn Kmeans, though these options do not take into account class.
What I tried:
I attempted to optimize the function above to yield the maximal value using scipy.optimize.minimize. Here for each threshold between two classes, I use the minimum value of the larger class and the maximal value of the smaller classes as a range with which to find the respective optimal cutoff point between those classes. With this approach I run into a problem, prompting:
ValueError: bins must increase monotonically.
def objfunc(grid):
x1, x2, x3 = grid
return (cohen_kappa_score((pd.cut(df.cont_attribute,bins=[0, x1, x2, x3, 100],labels=['1','2','3','4'], duplicates='drop').astype('int'))
, df['class_label'].astype('int'))) * (-1);
grid = (slice(df[(df['class_label'] == 2)]['cont_attribute'].min(), df[(df['class_label'] == 1)]['cont_attribute'].max(), 0.5), (slice(df[(df['class_label'] == 3)]['cont_attribute'].min(), df[(df['class_label'] == 2)]['cont_attribute'].max(), 0.5), (slice(df[(df['class_label'] == 4)]['cont_attribute'].min(), df[(df['class_label'] == 3)]['cont_attribute'].max(), 0.5))
solution = brute(objfunc, grid, finish=None,full_output = True)
solution
In python, is there a straightforward way to optimize thresholds x1
, x2
, x3
taking agreement with class into account (supervised discretization)? Alternatively, how can the above function be rewritten to yield a maximum using scipy.optimize.minimize?
The error message is not too hard. The pandas cut
method demands that the cut vector
[0,x1,x2,x3,100]
is strictly monotinic. By having some mechanism to make sure that no invalid values are passed to the cut
function, we are safe. That is what I implemented below. To denote an invalid
setting, it is customary to use np.inf
since all other values are lower. Therefore, every minizmier would say such an invalid is undesirable as a solution. See below for the implementation. I also included all the imports and some data generation, so that it is simple to use the code. Please do so in future questions as well.
You might want to use more than 10 bins per dimension in the brute force search.
Also - the code is quite inefficient. Since it brute forces over all combinations of x1, x2, x3, but a lot of them are invalid (e.g. x2<=x1), you might want to parametrize the problem in (x1,x2-x1, x3-x2) instead, and search over nonnegative values in the second and third component.
Finally, the brute
method is a minimizer, so you should return -cohen_kappa
from the objective
#%%
import numpy as np
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from scipy.stats import truncnorm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.optimize import brute
#
# Generate Data
#
n = 1000
np.random.seed(0)
y = np.random.choice(4, p=[0.1, 0.3, 0.4, 0.2], size=n)
x = np.zeros(n)
for i in range(5):
low = 0
high = 100
mymean = 20 * i
myscale = 8
a, b = (low - mymean) / myscale, (high - mymean) / myscale
x[y == i] = truncnorm.rvs(a=a, b=b, loc=mymean, scale=myscale, size=np.sum(y == i))
data = pd.DataFrame({"cont_attribute": x, "class_label": y})
# make a loss function that accounts for the bad orderings
def loss(cuts):
x1, x2, x3 = cuts
if 0 >= x1 or x1 >= x2 or x2 >= x3 or x3 >= 100:
return np.inf
yhat = pd.cut(
data["cont_attribute"],
bins=[0, x1, x2, x3, 100],
labels=[0, 1, 2, 3],
# duplicates="drop",
).astype("int")
return -cohen_kappa_score(data["class_label"], yhat)
# Compute the result via brute force
ranges = [(0, 100)] * 3
Ns=30
result = brute(func=loss, ranges=ranges, Ns=Ns)
print(result)
print(-loss(result))
# Evaluate the final result in a confusion matrix
x1, x2, x3 = result
data["class_pred"] = pd.cut(
data["cont_attribute"],
bins=[0, x1, x2, x3, 100],
labels=[0, 1, 2, 3],
duplicates="drop",
).astype("int")
mat = confusion_matrix(y_true=data['class_label'],y_pred=data['class_pred'])
plt.matshow(mat)
# Loop over data dimensions and create text annotations.
for i in range(4):
for j in range(4):
text = plt.text(j, i, mat[i, j],
ha="center", va="center", color="grey")
plt.xlabel('Predicted class')
plt.ylabel('True class')
plt.show()
# Evaluate result graphically
# inspect the data
fig,ax = plt.subplots(2,1)
sns.histplot(data=data, x="cont_attribute", hue="class_label",ax=ax[0],multiple='stack')
sns.histplot(data=data, x="cont_attribute", hue="class_pred",ax=ax[1],multiple='stack')
plt.show()
Regarding the use of scipy.optimize.minimize
, that is not possible when using the cohen kappa as a ojective. Since it is not differentiable, it is not so easy to optimize over. Consider using a cross entropy loss function instead. But in that case, you would need a (parametric) model for the classification task.
A standard ordinal classifier is available in the ordinal regression package in statsmodels
. It will be vastly faster than the brute method, but possibly less accurate when evaluated on cohen kappa. Going that route is probably what I would have done if going for a higher number of bins.