I'm training a S-BERT model in SageMaker, using Huggins Face library. I've followed the HF tutorials on how to define metrics to be tracked in the huggingface_estimator
, yet when my model is done training I cannot see any metric either in CloudWatch or by fetching the latest training job results:
`
from sagemaker.analytics import TrainingJobAnalytics
df = TrainingJobAnalytics(training_job_name=huggingface_estimator.latest_training_job.name).dataframe()
returns:
Warning: No metrics called loss found
Warning: No metrics called learning_rate found
Warning: No metrics called eval_loss found
Warning: No metrics called eval_accuracy found
Warning: No metrics called eval_f1 found
Warning: No metrics called eval_precision found
Warning: No metrics called eval_recall found
Warning: No metrics called eval_runtime found
Warning: No metrics called eval_samples_per_second found
Warning: No metrics called epoch found
Here's the code below
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role
from sagemaker import image_uris
role = get_execution_role()
source_dir = 's3://...'
output_path = 's3://...'
metric_definitions = [{'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
{'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]
estimator_image = image_uris.retrieve(framework='pytorch',region='eu-west-1',version='1.13.1',py_version='py39',image_scope='training', instance_type='ml.p3.2xlarge')
huggingface_estimator = HuggingFace(
entry_point='script.py',
dependencies=['requirements.txt', 'model.py'],
instance_type='ml.p3.2xlarge',
base_job_name='...',
output_path=output_path,
role=role,
instance_count=1,
pytorch_version=None,
py_version=None,
metric_definitions = metric_definitions,
image_uri=estimator_image,
hyperparameters = {
'epochs': 1,
'train_batch_size': 64,
'eval_batch_size':64,
'learning_rate': 2e-5,
'model_name':'distilbert-base-uncased'})
huggingface_estimator.fit({'train': 's3://...',
'test': 's3://...'})
Turns out the solution was available in this post
If using custom algorithms, and easy way to add metric is to print / log them in the script:
print('METRIC train_accuracy: {}'.format(accuracy))
And in the metric_definitions
you have to use the exact same naming in the Regex:
{'Name': 'train_accuracy', 'Regex': "METRIC train_accuracy: ([0-9]+(.|e\-)[0-9]+)\w+"}