I have a histogram metric measuring the time it takes to complete a post. I'm using OpenTelemetry's .NET library, and .NET's built-in Histogram class. Through this, I'm getting a metric with the bucket
prefix that has the le
attribute on it which contain the counts per bucket. According to what I've read, I should be able to use a Prometheus query in Grafana like the following with a Time Series visualization to see duration on the Y-axis, time on the X-axis, and a line showing the chosen quantile:
histogram_quantile(0.99, sum by(le) (rate(my_metric_bucket)[5m]))
When I do this, the Y-axis doesn't seem to match up to what I expect, as shown here when I have a particular entity that I know is taking ~20 seconds to respond, but the 0.99 quantile graph gets pinned at 10,000ms (if indeed the Y-axis is ms as expected).
My code also has a Gauge measuring the time taken (my Timer abstraction creates a Gauge and a Histogram and records both), so I can confirm that this was actually taking around 20s, as shown in this query:
Is there a way for me to have the Y-axis be duration, and the line to show how long a percentile of the requests take over time?
Per the comment by @markalex, the issue here was that my buckets for the histogram topped out at 10000, so when the value was above that there was no way for the quantile to show this. I've adjusted the buckets to more accurately cover the expected ranges for the value, and now everything looks better.
Some good resources (also provided by @markalex) to see how the histogram_quantile
function operates are:
Prometheus documentation on errors in quantile estimation (I was seeing an extreme case of this)
This answer by @Ace with good detail on how exactly the histogram_quantile
function operates.