I created a script to test and get the CPU usage in thousands (ex. 50m=50/1000 CPU core utilization) using rate() and then get the sum to compare the two results:
window_sec=120
curl -s -G "http://master.com:30355/api/v1/query_range" \
--data-urlencode "query=rate(container_cpu_usage_seconds_total{namespace='arka-ingress', \
pod='pod', container=''}[${window_sec}s]) * 1000" \
--data-urlencode "start=$(($(date +%s)-window_sec))" \
--data-urlencode "end=$(date +%s)" \
--data-urlencode "step=15s" | jq
curl -s -G "http://master.com:30355/api/v1/query" \
--data-urlencode "query=avg(rate(container_cpu_usage_seconds_total{namespace='arka-ingress', \
pod='pod', container=''}[${window_sec}s])) * 1000" | jq
Here is the output:
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"beta_kubernetes_io_arch": "amd64",
"beta_kubernetes_io_os": "linux",
"cpu": "total",
"feature_node_kubernetes_io_network_sriov_capable": "true",
"id": "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0c5de2b8_2dea_4dd6_9a5c_bef29bc01c35.slice",
"instance": "master.com",
"job": "kubernetes-nodes-cadvisor",
"kubernetes_io_arch": "amd64",
"kubernetes_io_hostname": "master.com",
"kubernetes_io_os": "linux",
"namespace": "ingress",
"node_openshift_io_os_id": "rhcos",
"pod": "pod"
},
"values": [
[
1685838435,
"3.0482985848760746"
],
[
1685838450,
"3.0482985848760746"
],
[
1685838510,
"3.604911126207486"
]
]
}
]
}
}
{
"status": "success",
"data": {
"resultType": "vector",
"result": []
}
}
The data from the second response should be the average of the values in the result, but it's empty. When I increase the window, window_sec, I do get a response but it's not the average and the value is incorrect. I know the first query with rate() is correct, did I write the second query incorrect?
The avg() function returns the average value across multiple time series at the specified timestamp passed via time
query arg to /api/v1/query. This is so called aggregate function. That's why it returns "unexpected" results.
The rate function returns the average per-second increase rate per every selected counter over the specified lookbehind window in square brackets. For example, rate(container_cpu_usage_seconds_total[1h])
returns the average CPU usage per each container over the last hour.
PromQL provides min_over_time and max_over_time functions, which can be used for calculating the minimum and the maximum values per each selected time series on the given lookbehind window in square brackets. For example, the following query returns the minimum across the last 12 average 5-minute CPU usage calculations over the last hour:
min_over_time(rate(container_cpu_usage_seconds_total[5m])[1h:5m])
This query uses subquery feature, which isn't trivial to understand and to use properly.
P.S. there is an alternative Prometheus-like monitoring system I work on, which provides rollup_rate() function, which returns min, avg and max per-second increase rates on the given lookbehind window. See this article for details.