java grafana spring-boot-actuator spring-micrometer

Java Micrometer - What to do with metrics of type *_bucket

Quick question regarding metrics of type *_bucket please.

My application generates metrics, like those below:


# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.005592405",} 273.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.006990506",} 797.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.008388607",} 2638.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.009786708",} 3543.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.011184809",} 3932.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.01258291",} 4154.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.013981011",} 4279.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/health",le="0.015379112",} 4380.0

and

# HELP resilience4j_circuitbreaker_calls_seconds Total number of successful calls
# TYPE resilience4j_circuitbreaker_calls_seconds histogram
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001048576",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001398101",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.001747626",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002097151",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002446676",} 0.0
resilience4j_circuitbreaker_calls_seconds_bucket{kind="successful",name="someName",le="0.002796201",} 0.0

I believe they are really useful, but unfortunately, I do not know what to do with them.

I tried some queries such as rate(http_server_requests_seconds{_bucket_=\"+Inf\", status=~\"2..\"}[5m]), but does not seems to bring anything valuable out.

May I ask what is the proper way to use those metrics of type *_bucket, for instance, how to build Grafana dashboards and visuals that are the best suited for those *_bucket please?

Thank you

Solution

you can find 99th percentile/95th percentile of the latency of given endpoint using this metric and can use histogram_quantile function for that. e.g. For 99th percentile :

histogram_quantile(
  0.99, 
  sum(
    rate(
      http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)

For 95th percentile :

histogram_quantile(
  0.95, 
  sum(
    rate(http_server_requests_seconds_bucket{exception="None", uri = "/your-uri"}[5m])
  ) by (le)
)

More on it: A nice snippet from reference: https://idanlupinsky.com/blog/application-monitoring-with-micrometer-prometheus-grafana-and-cloudwatch/

The histogram is a collection of buckets (or counters), each maintaining the number of events observed that took up to duration specified by the le tag. Let's have a look at a part of the histogram as published by our demo application:

http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.067108864",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.089478485",} 0.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.111848106",} 92382.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.134217727",} 99050.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.156587348",} 99703.0
...
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="0.984263336",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="1.0",} 99987.0
http_server_requests_seconds_bucket{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/demo",le="+Inf",} 100000.0

The second line in the listing above indicates there were no requests observed that took up to ~89ms (specified by the le tag). This is expected given the 100ms sleep time when processing requests. Line #3 shows that 92,382 requests were observed whose duration took up to ~111ms. Note that the histogram is cumulative and that the entire count of requests falls in the last bucket with no upper limit le="+Inf".