i want to achieve the specified SLI(Service Level Indicator) for our http endpoints using blackbox exporter for probing like the following indicators: 80% availability Latency less than 1s
For latency i figured i can use the query probe_http_duration_seconds > 1 but for availability i am not sure i am doing it correctly with quantile_over_time(0.80, probe_http_status_code)[1d] > 400. The condition greater than 400 is used to check for http errors because i assume the http status code above 400 is an error. Is this correct for my case if not please guide me. Thanks
If you want to calculate ratio of successful probes to number of all registered probes:
count_over_time((probe_http_status_code<400)[1d:])/count_over_time(probe_http_status_code[1d:])
If you want to find ratio of successful probes to number of all possible probes (assuming that some probes were not executed, for example if blackbox_exporter was down):
count_over_time((probe_http_status_code<400)[1d:])/1440
where 1440
is number of possible porbes within specified time range (1440 is a result of 1d
/ 1m
, assuming scrape_interval
is 1 minute, change according to your setup).