[SOLVED] Create a CloudWatch alarm for a rising trend of a metric

Create a CloudWatch alarm for a rising trend of a metric

I have a simple metric that I send to AWS CW that is the number of MS a method takes to execute.

I'd like to create an alarm that would detect a trend of this metric rising (If I'm able to explain it correctly in English)

Example: In the last 10 minutes, the metrics sent were 50ms, 70ms, 90ms, 110ms and etc. rising.

Is that even possible in CW?

I failed to do that because I'm really not a fan of the CW UI and couldn't find similar examples of that. Tried some math functions and queries but I think I failed to understand completely what the UI wants from me.

Solution

Use the RATE function to detect when a metric is rising. From the metric math documentation:

Returns the rate of change of the metric per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values.

As an example I query CloudWatch's own IncomingLogEvents metric in my lab account. Exactly what it represents doesn't matter. I chose the metric because in my account it increases and decreases.

You said you don't like the UI, so here's a CLI example to start with.

The query retrieves 10 minutes of the metric at 60-second intervals and computes the rate of change.

aws cloudwatch get-metric-data \
--start-time "$(date --date "2023-09-01T11:20:00" --utc +%s)" \
--end-time "$(date --date "2023-09-01T11:30:00" --utc +%s)" \
--metric-data-queries '
[
  {
    "Id": "logsIncomingBytesSum",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/Logs",
        "MetricName": "IncomingBytes",
        "Dimensions": []
      },
      "Period": 60,
      "Stat": "Sum",
      "Unit": "Bytes"
    }
  },
  {
    "Id": "logsIncomingBytesRate",
    "Expression": "RATE(logsIncomingBytesSum)"
  }
]
'

The command returns this metric data. This is the same data that the CloudWatch UI uses to draw the chart.

The first Values list is the source metric data. The second Values list is the math expression result.

It turns out that not all minutes in the range have a data point. Minutes 20, 22, and 24 are missing. We'll see whether the RATE function takes that into account.

There is one less rate value than the number of metric values. That's because the oldest metric value has no previous value to subtract.

{
    "MetricDataResults": [
        {
            "Id": "logsIncomingBytesSum",
            "Label": "IncomingBytes",
            "Timestamps": [
                "2023-09-01T11:29:00+00:00",
                "2023-09-01T11:28:00+00:00",
                "2023-09-01T11:27:00+00:00",
                "2023-09-01T11:26:00+00:00",
                "2023-09-01T11:25:00+00:00",
                "2023-09-01T11:23:00+00:00",
                "2023-09-01T11:21:00+00:00"
            ],
            "Values": [
                99580.0,
                8746.0,
                2753.0,
                4790.0,
                4724.0,
                3801.0,
                8080.0
            ],
            "StatusCode": "Complete"
        },
        {
            "Id": "logsIncomingBytesRate",
            "Label": "logsIncomingBytesRate",
            "Timestamps": [
                "2023-09-01T11:29:00+00:00",
                "2023-09-01T11:28:00+00:00",
                "2023-09-01T11:27:00+00:00",
                "2023-09-01T11:26:00+00:00",
                "2023-09-01T11:25:00+00:00",
                "2023-09-01T11:23:00+00:00"
            ],
            "Values": [
                1513.9,
                99.88333333333334,
                -33.95,
                1.1,
                7.691666666666666,
                -35.65833333333333
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}

Use a calculator to check the expression values.

bc -l <<< "
scale=2;
(99580.0 - 8746.0) / 60;
(8746.0 - 2753.0) / 60;
(2753.0 - 4790.0) / 60;
(4790.0 - 4724.0) / 60;
(4724.0 - 3801.0) / 120;
(3801.0 - 8080.0) / 120;
"

Allowing for differences in precision, the check matches the expression result.

1513.90
99.88
-33.95
1.10
7.69
-35.65

So in my example the metric is rising at minutes 25, 26, 28, and 29, and falling at minutes 23 and 27.

Assuming your data is sampled every minute, I'd expect the RATE of the average execution time to look like this: .34, .34, .34.

On top of this can configure an alarm to trigger when the expression is positive.

The next example shows the same data in the CloudWatch UI. I configured the UI with the same query settings.

The first chart shows only the source metric.

The second chart shows only the expression result. The chart is limited to range (-100,100) so that you can see the smaller variations. The stacked area widget is nice here because the filled in areas around zero make it easy to see when the metric is falling and rising.