pytorchhuggingface-transformersamazon-sagemakerhuggingfaceamazon-sagemaker-studio

Why aren't my metrics showing in SageMaker (CloudWatch)?


I'm training a S-BERT model in SageMaker, using Huggins Face library. I've followed the HF tutorials on how to define metrics to be tracked in the huggingface_estimator, yet when my model is done training I cannot see any metric either in CloudWatch or by fetching the latest training job results: `

from sagemaker.analytics import TrainingJobAnalytics
df = TrainingJobAnalytics(training_job_name=huggingface_estimator.latest_training_job.name).dataframe()

returns:

Warning: No metrics called loss found
Warning: No metrics called learning_rate found
Warning: No metrics called eval_loss found
Warning: No metrics called eval_accuracy found
Warning: No metrics called eval_f1 found
Warning: No metrics called eval_precision found
Warning: No metrics called eval_recall found
Warning: No metrics called eval_runtime found
Warning: No metrics called eval_samples_per_second found
Warning: No metrics called epoch found

Here's the code below

from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

from sagemaker import image_uris

role = get_execution_role() 

source_dir = 's3://...'
output_path = 's3://...'

metric_definitions = [{'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
                      {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

estimator_image = image_uris.retrieve(framework='pytorch',region='eu-west-1',version='1.13.1',py_version='py39',image_scope='training', instance_type='ml.p3.2xlarge')


huggingface_estimator = HuggingFace(
                            entry_point='script.py',
                            dependencies=['requirements.txt', 'model.py'],
                            instance_type='ml.p3.2xlarge',
                            base_job_name='...',
                            output_path=output_path,
                            role=role,
                            instance_count=1,
                            pytorch_version=None,
                            py_version=None,
                            metric_definitions = metric_definitions,
                            image_uri=estimator_image,
                            hyperparameters = {
                                'epochs': 1,
                                'train_batch_size': 64,
                                'eval_batch_size':64,
                                'learning_rate': 2e-5,
                                'model_name':'distilbert-base-uncased'})

huggingface_estimator.fit({'train': 's3://...',
                           'test': 's3://...'})

Solution

  • Turns out the solution was available in this post

    If using custom algorithms, and easy way to add metric is to print / log them in the script:

    print('METRIC train_accuracy: {}'.format(accuracy))
    

    And in the metric_definitions you have to use the exact same naming in the Regex:

    {'Name': 'train_accuracy', 'Regex': "METRIC train_accuracy: ([0-9]+(.|e\-)[0-9]+)\w+"}