Getting Precision / Recall / F1 scores in Azure AI Foundry Evaluation

I'm working on a text classifier built on Azure AI Foundry. My colleagues have already set up a Review Evaluation which calculates an accuracy score using String Check :

Check if {{sample.output_text}} Contains {{item.humanlabel}}

Accuracy is a good start, but we'd like to get Precision and Recall (and then calculate F1 from those). The Azure docs suggest the platform can do this, but don't seem to tell you how. Is there an existing (or relatively easy) method to get these metrics?

Solution

Yes, you can calculate precision, recall, and F1 score when evaluating a custom text classification model in Azure AI Foundry, even though the default "Review Evaluation" setup might only show accuracy initially. Here’s how you can approach it:

Azure AI Foundry’s Review Evaluation calculates an accuracy score by checking if the model’s output contains the human label (using {{sample.output_text}} Contains {{item.humanlabel}}). However, precision, recall, and F1 require explicit counts of True Positives, False Positives, False Negatives, and True Negatives.

The Azure documentation on Custom Text Classification Evaluation Metrics explains these metrics:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 * (Precision * Recall) / (Precision + Recall)

Where:

TP = True Positives
FP = False Positives
FN = False Negatives
TN = True Negatives (not used in F1, but useful for completeness)

Refer this MSDOC for Azure AI Custom Text Classification Evaluation Metrics. Currently, Azure AI Foundry doesn’t provide a built-in toggle for these metrics in the UI. However, you can extract the predictions and ground truth data from the evaluation results and calculate them : Export Evaluation Results In Azure AI Foundry, navigate to the Evaluation tab of your model. Use the “Download” option to export the results (usually a CSV or JSON file).

For each sample, identify:
Predicted label(s) (from output_text or prediction)
Human label (ground truth, e.g., humanlabel)

Use a Python script (or Excel, or other tools) to count:

TP = Predicted label matches human label
FP = Predicted label does not match but is predicted
FN = Human label present but not predicted Calculate precision, recall, and F1 using the formulas above.

Refer this doc for scikit-learn metrics API.