So our server consists of users, and each user may select one of the 3rd party services we provide to communicate with.
Each 3rd party service has a different size of user population communicating with it through our system (and increasing):
We want to create an alert whenever any of these services are down (meaning monitoring 500s).
We send a metric from a central networking point in our code when 500 occurs, includes the url
of the service as a tag.
A couple of constraints:
We prefer to create just one monitor that catches all and reports each service individually (so if service A and B are down, we get 2 alerts). We don't want to create multiple monitors for the same purpose to monitor different services (and maybe create a composite monitor) because the services we communicate with might increase in the future.
We don't want to explicitly set a threshold
on the number of 500s on the single monitor we create, above which the monitor sends an alert, because each service has a different size of user population, so 10 occurrences in 10 mins of 500 for Service (C) (has 100k) shouldn't be considered as service down, compared to Service (B) (has 5k).
I thought of using Outlier or Anomaly monitors but we're trying to figure out the best configuration for it to avoid any false positives. So changing the Outlier algorithm between DBSCAN
and MAD
sometimes yield nothing and changing the tolerance yields false positives.
This is with DBSCAN
, tolerance 3.0 - the big spike is not detected
tolerances till 1.0 detects nothing, but 0.5 detects everything, which might be false positives
Same behavior with MAD
algorithms , there's no specific tolerance to catch the correct values
Any recommendations regarding the configuration above is welcome, or even if you think there should be a different kind of a monitor used.
Multi Alert
monitor to alert for each service that meets the threshold.A Multi Alert monitor triggers individual notifications for each entity in a monitor that meets the alert threshold.
For example, when setting up a monitor to notify you if the P99 latency, aggregated by service, exceeds a certain threshold, you would receive a separate alert for each individual service whose P99 latency exceeded the alert threshold.
https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#alert-grouping