Original problem. I would like to have a Kubernetes cluster with at least 2 nodes with zero GPU consumption. If a job is coming and takes one node, then autoscaler should create another spare node.
I found out that I can rely on DCGM_FI_DEV_GPU_UTIL
metrics. If DCGM_FI_DEV_GPU_UTIL == 0
then the node is in "idle" mode. In PromQL I can just write count(DCGM_FI_DEV_GPU_UTIL == 0)
and get the number of "idle" nodes.
However, I do not understand how to write metricsQuery in Prometheus Adapter config. All examples that I found are about
(sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)
However, I need something like count(<<.Series>> == 0)
, but this does not work. Any idea how I can get this metrics for HPA which indicates the number of nodes with no GPU consumption?
I ended up with KEDA with the prometheus trigger. It is easy to use and supports PromQL query. The only disadvantage that it is "average value" scaler, but it is not critical in my case.