amazon-web-servicesamazon-cloudwatchcloudwatch-alarms

CloudWatch alarm not triggering on missing data


I have an AWS CloudWatch alarm with this configuration:

Type
Metric alarm

State
OK

Threshold
METRIC_NAME <= 0 for 1 datapoints within 1 day

Last change
2022-04-14 23:30:54

Actions
Actions enabled

Metric name
METRIC_NAME

Statistic
Average

Period
1 day

Datapoints to alarm
1 out of 1

Missing data treatment
Treat missing data as bad (breaching threshold)

Percentiles with low samples
evaluate

For the past few days, we have been missing data for this metric, resulting in a graph that looks like this:

missing data

My understanding is that given the above configuration, and the missing data for the past 3 days, this alarm should have triggered. Yet, it has not. Based on the AWS docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html, an alarm with a 1 day period, 1 out of 1 datapoints to alarm, and missing data treatment of breaching should change the alarm state from OK -> ALARM. Am I missing a key component here? Thanks!


Solution

  • Rereading their docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html, specifically the section titled How alarm state is evaluated when data is missing, I believe I have figured out the issue.

    CloudWatch has what seems to be an unconfigurable "evaluation range" that they use when determining when to alarm on missing data. It actually takes something like 3? or 4? consecutive periods of missing data before the alarm will transition to the ALARM state. Given that our period is 1 day, that means we will not be notified of missing data until the third or fourth day after this anomaly, which is not explained in the alarm configuration.

    To remedy this issue, we changed our alarms to use the metric math FILL function, which fills missing data points in the period with a specified value. In my case, I filled missing data points for my metric with a breaching value of 0.

    Example where m1 is the metric that we were originally tracking with the alarm.