spring-bootspring-boot-actuatorspring-micrometer

how to use percentiles-histogram without risk of high cardinality and human-usable buckets?


I created my own advices in springboot 1 to monitor our apps, and only now try to learn what new versions of springboot offer(currently working with 3.3.1, but will move on to later version later), and what we can use. For example one thing one thing I'd like to see is say what percentage of requests lasts longer than one second. And lets talk about measuring REST, since repositores etc. works a little bit differently and has different configuration. And I'm using prometheus with grafana. If I'm doing something wrong, please advise/correct me.

So if I understand correctly, I need to enable:

management.metrics.distribution.percentiles-histogram.http.server.requests=true

which will export data over prometheus endpoint like:

http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="+Inf"} 1

and there will be 73 buckets like this, which seems to be a rather lot for single endpoint, which we need to even multiply by all possible values in tag exception. OK, so lets try to bring it down, I don't need this precision. There is a configuration for that:

management.metrics.distribution.slo.http.server.requests=100ms,500ms,1s,3s

which works fine, but not really. Because single call will fill following metrics:

http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.001"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.001048576"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.001398101"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.001747626"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.002097151"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.002446676"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.002796201"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.003145726"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.003495251"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.003844776"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.004194304"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.005592405"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.006990506"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.008388607"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.009786708"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.011184809"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.01258291"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.013981011"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.015379112"} 0
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.016777216"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.022369621"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.027962026"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.033554431"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.039146836"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.044739241"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.050331646"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.055924051"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.061516456"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.067108864"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.089478485"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.1"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.111848106"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.134217727"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.156587348"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.178956969"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.20132659"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.223696211"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.246065832"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.268435456"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.357913941"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.447392426"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.5"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.536870911"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.626349396"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.715827881"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.805306366"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.894784851"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="0.984263336"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="1.0"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="1.073741824"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="1.431655765"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="1.789569706"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="2.147483647"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="2.505397588"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="2.863311529"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="3.0"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="3.22122547"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="3.579139411"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="3.937053352"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="4.294967296"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="5.726623061"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="7.158278826"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="8.589934591"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="10.021590356"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="11.453246121"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="12.884901886"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="14.316557651"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="15.748213416"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="17.179869184"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="22.906492245"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="28.633115306"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="30.0"} 1
http_server_requests_seconds_bucket{error="none",exception="none",method="PUT",outcome="SUCCESS",status="200",uri="/my/uri",le="+Inf"} 1

so yes, we're using these buckets, I can see it filled in configuration properties and fetch before each call, but we're still those 73 properties to export.

But fine then, lets assume we can handle high cardinality(can we?). How to display percentage of requess which lasts more than second. No grafana/prometheus pro in any way, but this could do the trick:

(
  sum by(uri) (rate(http_server_requests_seconds_bucket{le="+Inf", job="app"}[$rate_period]))
-
  sum by(uri) (rate(http_server_requests_seconds_bucket{le="1.0", job="app"}[$rate_period]))
)
/
sum by(uri) (rate(http_server_requests_seconds_bucket{le="+Inf", job="app"}[$rate_period]))

*100

the problem here lies in le="1.0". There are very few human usable values here, I'm not sure if these boundaries are actually stable, and I'm not expecting operator to type during oncall 0.626349396 from his memory. Yes, I know I can use histogram_quantiles to show similar data differently, but it's not really equally readable.

Debugging spring and micrometer internals I really don't see this management.metrics.distribution.slo.http.server.requests actually used. Does anyone know how is this supposed to work? Does it work?

What am I doing wrong? What are other options or workarounds?


Solution

  • I think I found it.

    It wasn't super clear to me (or my searches, and it might be validated actually to issue warning), but:

    management.metrics.distribution.slo.http.server.requests=100ms,500ms,1s,3s

    and

    management.metrics.distribution.percentiles-histogram.http.server.requests=true

    are kinda mutually exclusive. See io.micrometer.core.instrument.distribution.DistributionStatisticConfig#getHistogramBuckets

    Specifying 'slo buckets' will actually create them as requested, BUT these will be burried under lots of default ones, which are created upon enabling precentiles-histogram. There is list of 275 default buckets pre-created and we select subset of them based on expected minimum and maximum duration. By default (io.micrometer.core.instrument.AbstractTimerBuilder#AbstractTimerBuilder) these are 1 millisecond and 30s respectively. Which you can override using org.springframework.boot.actuate.autoconfigure.metrics.MetricsProperties.Distribution#minimumExpectedValue.

    I don't understand this sufficiently, and this precision might be needed for some usecase. But if you need just if something is slower than some threshold (and mostly if smth is slower than 1s, it's bad regardless of how much), it might be safer just to specify slo thresholds.

    If I'm still missing something or am wrong altogeher, please let me know!