amazon-cloudwatchhuggingface-transformersamazon-sagemaker

SgaeMaker training: what's the correct REGEX patrern to capture metrics?


This is the pattern I've seen suggested in a few different posts on SO:

metric_definitions = [
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]

The issue is, it fails to capture the e-0x after the digit. I've tried a few variant like these one: ([0-9]+(.*e\-)[0-9]+)\w+ which I have tested on https://regexr.com/. While it works on the website it still fails to capture the exponent part in CloudWatch.

I noticed the issue because my loss was going up and down, and when I checked the log directly I could see the loss was only going down, except every time it went from 1.254e-05 to 9.365e-06 only the first portion was captured, so it looked like the loss was just going back up and the model was not learning.


Solution

  • The expression you used has some issues. It only works for "1234e-05", and doesn't work for "1.234e-05". Also "." has to be escaped with back-slash ("\.") to strictly match a period character.

    Instead, please try (\d+(\.\d+)?(e-\d+)?)

    I only tested on Python's regular expression module, but it should capture all following patterns.