z3py threshold Optimization results in worse performance than unoptimized solution

In a previous question, I asked about optimizing the decision threshold of a prediction model. The solution led me to the z3py library.

I am now trying a similar setup as before, but want to optimize the decision threshold of a binary prediction model to maximize the accuracy.

However, I found that optimization on the threshold results in worse performance than with the default threshold (which could also be chosen by the optimizer).

My MWP is below (it uses fixed-seed random targets and probabilities to replicate my findings):

import numpy as np
from z3 import z3


def compute_eval_metrics(ground_truth, predictions):
    from sklearn.metrics import accuracy_score, f1_score

    accuracy = accuracy_score(ground_truth, predictions)
    macro_f1 = f1_score(ground_truth, predictions, average="macro")
    return accuracy, macro_f1


def optimization_acc_target(
    predictions: np.array,
    ground_truth: np.array,
    default_threshold=0.5,
):
    tp = np.sum((predictions > default_threshold) & (ground_truth == 1))
    tn = np.sum((predictions <= default_threshold) & (ground_truth == 0))

    initial_accuracy = (tp + tn) / len(ground_truth)
    print(f"Accuracy: {initial_accuracy:.3f}")

    _, initial_macro_f1_score = compute_eval_metrics(
        ground_truth, np.where(predictions > default_threshold, 1, 0)
    )

    n = len(ground_truth)
    iRange = range(n)

    threshold = z3.Real("threshold")

    opt = z3.Optimize()
    predictions = predictions.tolist()
    ground_truth = ground_truth.tolist()

    true_positives = z3.Sum(
        [
            z3.If(predictions[i] > threshold, 1, 0)
            for i in iRange
            if ground_truth[i] == 1
        ]
    )
    true_negatives = z3.Sum(
        [
            z3.If(predictions[i] <= threshold, 1, 0)
            for i in iRange
            if ground_truth[i] == 0
        ]
    )
    acc = z3.Sum(true_positives, true_negatives) / n

    # Add constraints
    opt.add(threshold >= 0.0)
    opt.add(threshold <= 1.0)

    # Maximize accuracy
    opt.maximize(acc)

    if opt.check() == z3.sat:
        m = opt.model()

        t = m[threshold].as_decimal(10)
        if type(t) == str:
            if len(t) > 1:
                t = t[:-1]
        t = float(t)
        print(f"Optimal threshold: {t}")

        optimized_accuracy, optimized_macro_f1_score = compute_eval_metrics(
            ground_truth, np.where(np.array(predictions) > t, 1, 0)
        )

        print(f"Accuracy: {optimized_accuracy:.3f} (was: {initial_accuracy:.3f})")
        print(
            f"Macro F1 Score: {optimized_macro_f1_score:.3f} (was: {initial_macro_f1_score:.3f})"
        )
        print()

    else:
        print("Failed to optimize")


np.random.seed(42)
ground_truth = np.random.randint(0, 2, size=50)
predictions = np.random.rand(50)

optimization_acc_target(
    predictions=predictions,
    ground_truth=ground_truth,
)

In my code, I am using the true positive and true negative count to yield the accuracy.

The output is:

Accuracy: 0.600
Optimal threshold: 0.9868869366
Accuracy: 0.480 (was: 0.600)
Macro F1 Score: 0.355 (was: 0.599)

It always returns a worse solution than the default threshold of 0.5). I am puzzled why this could be the case? Should it not performe at least as good as the default solution?

To solve this, I tried using constructs from z3py (e.g. z3.If in the z3.Sum parts), thinking that maybe different data types lead to wrong results? But this turned out to not make a difference (which is good, as this aligns with an official example). I also found this GitHub issue, but that seems to relate to a case with non-linear constraints (which I am not using).

I am now wondering: what causes the results with the optimized threshold to be worse than the default threshold? I appreciate pointers to further resources and background information.

Solution

I found the solution, and it was simple, I am afraid:

In the posted question, I used integer division.

acc = z3.Sum(true_positives, true_negatives) / n

Following more checking, I found another SO question here. This brough me to the line causing troubles given above.

What worked in the end was:

# yes
# acc = z3.ToReal(true_positives + true_negatives) / n
# alternatively, only maxiimize TP and TN count (gives same results):
acc = true_positives + true_negatives

This gives following expected output (accuracy and F1 improvements):

Optimal threshold: 0.3886772896
Accuracy: 0.620 (was: 0.600)
Macro F1 Score: 0.616 (was: 0.599)

For the records and further searchers: following attempts do not work:

# No: acc = z3.ToReal((true_positives + true_negatives) / n)
# No: acc = z3.Real(true_positives + true_negatives) / n