prometheuspromqlpercentile

How to get highest and average 95th percentile over a period of time in Prometheus Query Language


I need to get the highest and average 95th percentile recorded over time in PromQL.

This query only gets the current 95th percentile.

histogram_quantile(
  0.95, 
  sum(
    rate(
      istio_request_duration_milliseconds_bucket{destination_workload=~"service", reporter="source"}[1m]
    )
  ) by (le, destination_workload)
)

Result: (current value)

Time    destination_workload    Value
...     service                 488

Because histogram_quantile outputs an instant vector, i cannot do a max_over_time() on it.

I tried this, but i get values less than 1 for some reason, and one value for each le bucket.

quantile_over_time(
  0.95, 
  sum(
    rate(
      istio_request_duration_milliseconds_bucket{destination_workload=~"service", reporter="source"}[1m]
    )
  ) by (le, destination_workload)[1m:]
)

Result: I don't know why the values are so tiny, or why it doesn't handle the buckets.

Time    destination_workload    le         Value
...     service                 0.5        0
...     service                 1          0
...     service                 5          0
...     service                 10         0
...
...     service                 600000     0.167

We do not use Victoria Metrics.

How can i get the maximum and average 95th percentile recorded over a period of time.


Solution

  • Usually averaging the percentiles doesn't make sense. Here is a great article on this.

    To get the max you could use:

    histogram_quantile(
      1, 
      sum(
        rate(
          istio_request_duration_milliseconds_bucket{destination_workload=~"service", reporter="source"}[1m]
        )
      ) by (le, destination_workload)
    )
    

    To get the average you could use (_sum and _count metrics are provided by Histogram as well as _bucket):

    rate(istio_request_duration_milliseconds_sum{destination_workload=~"service", reporter="source"}[1m])
    /
    rate(istio_request_duration_milliseconds_count{destination_workload=~"service", reporter="source"}[1m])
    
    

    Prometheus Subquery

    If you really know what you're doing and you really need taking the max and the average on the calculated percentile, you could use the Subquery to get a range vector of calculated percentiles. The range vector in that case is a result of subsequent invocations of the instant-vector function (f.e. histogram_quantile) for some duration (f.e. last 5 min) with some resolution (f.e. each 1 min).

    Average 95p over last 5 min with the 1 min step:

    avg_over_time(histogram_quantile(0.95, sum(rate(demo_api_request_duration_seconds_bucket[5m]))by(le))[5m:1m])
    

    Max 95p over last 5 min with the 1 min step:

    max_over_time(histogram_quantile(0.95, sum(rate(demo_api_request_duration_seconds_bucket[5m]))by(le))[5m:1m])