sre

Understand the thinking behind "slow error is even worse than a fast error"


While reading SRE 4 golden signals in (under the Latency section) https://sre.google/sre-book/monitoring-distributed-systems/

I specifically unable to understand of the below line

On the other hand, a slow error is even worse than a fast error!

What does it mean and if can provide any easy to understand example please?

[Research] While reading a book, I have tried to understand the context but I couldn't able to grasp/visualise it correctly. I did thorough (in my knowledge limit) search on internet but I am sure I am missing out right keywords. Finally have taken a route to ask on Stackoverflow.


Solution

  • Here's the whole paragraph for context:

    Latency

    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

    I would interpret the sentence you pointed out as follows:

    In short, the author suggests the following two metrics for latency:

    Hope this helps.