I have an AWS CloudWatch alarm with this configuration:
Type
Metric alarm
State
OK
Threshold
METRIC_NAME <= 0 for 1 datapoints within 1 day
Last change
2022-04-14 23:30:54
Actions
Actions enabled
Metric name
METRIC_NAME
Statistic
Average
Period
1 day
Datapoints to alarm
1 out of 1
Missing data treatment
Treat missing data as bad (breaching threshold)
Percentiles with low samples
evaluate
For the past few days, we have been missing data for this metric, resulting in a graph that looks like this:
My understanding is that given the above configuration, and the missing data for the past 3 days, this alarm should have triggered. Yet, it has not. Based on the AWS docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html, an alarm with a 1 day period, 1 out of 1 datapoints to alarm, and missing data treatment of breaching
should change the alarm state from OK -> ALARM. Am I missing a key component here? Thanks!
Rereading their docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html, specifically the section titled How alarm state is evaluated when data is missing, I believe I have figured out the issue.
CloudWatch has what seems to be an unconfigurable "evaluation range" that they use when determining when to alarm on missing data. It actually takes something like 3? or 4? consecutive periods of missing data before the alarm will transition to the ALARM state. Given that our period is 1 day, that means we will not be notified of missing data until the third or fourth day after this anomaly, which is not explained in the alarm configuration.
To remedy this issue, we changed our alarms to use the metric math FILL function, which fills missing data points in the period with a specified value. In my case, I filled missing data points for my metric with a breaching value of 0.
Example where m1 is the metric that we were originally tracking with the alarm.