[SOLVED] avg_over_time of a max

avg_over_time of a max

I have a gauge metric badness which goes up when my service is performing poorly. There is one gauge per instance of the service and I have many instances.

I can take a max over all instances so that I can see how bad the worst instance is:

max(badness)

This graph is noisy because the identity of the worst instance, and how bad it is, changes frequently. I would like to smooth it out by applying a moving average. However, this doesn't work (I get a PromQL syntax error):

avg_over_time(max(badness)[1m])

How can I apply avg_over_time() to a timeseries that has already been aggregated with the max() operator?

My backend is VictoriaMetrics so I can use either MetricsQL or pure PromQL.

Solution

The avg_over_time(max(process_resident_memory_bytes)[5m]) query works without issues in VictoriaMetrics. It may fail if you use promxy in front of VictoriaMetrics, since promxy doesn't support MetricsQL - see this issue for details.

The query can be fixed, so it may work in Prometheus and promxy - just add a colon after 5m in square brackets:

avg_over_time(max(process_resident_memory_bytes)[5m:])

This is named subquery in Prometheus world. See mode details about subquery specifics in VictoriaMetrics and Prometheus in this article