scalaakkaakka-actorkamon

Handleing AskTimeoutException on Akka based application


I have the following HTTP-based application that routes every request to an Akka Actor which uses a long chain of Akka Actors to process the request.

   path("process-request") {
      post {
        val startedAtAsNano = System.nanoTime()
        NonFunctionalMetrics.requestsCounter.inc()
        NonFunctionalMetrics.requestsGauge.inc()
        entity(as[Request]) { request =>
          onComplete(distributor ? [Response](replyTo => Request(request, replyTo))) {
            case Success(response) =>
              NonFunctionalMetrics.requestsGauge.dec()
              NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.OK.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
              complete(response)
            case Failure(ex) =>
              NonFunctionalMetrics.requestsGauge.dec()
              NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.INTERNAL_SERVER_ERROR.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
              logger.warn(s"A general error occurred for request: $request, ex: ${ex.getMessage}")
              complete(InternalServerError, s"A general error occurred: ${ex.getMessage}")
          }
        }
      }
    }

As you can see, I'm sending the distributor an ask request for response.

The problem is that on really high RPS, sometimes, the distributor fails with the following exception:

2022-04-16 00:36:26.498 WARN c.d.p.b.http.AkkaHttpServer - A general error occurred for request: Request(None,0,None,Some(EntitiesDataRequest(10606082,0,-1,818052,false))) with ex: Ask timed out on [Actor[akka://MyApp/user/response-aggregator-pool#1374579366]] after [5000 ms]. Message of type [com.dv.phoenix.common.pool.WorkerPool$Request]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.

This is a typical non-informative Exception, the normal processing time is about 700 micros, 5 seconds its must be stuck somewhere at the pipeline since it cannot be that high.

I want to monitor this, I thought about adding Kamon integration which provides Akka Actors module with mailboxes, etc.

I tried to add the following configurations but its not worked for me: https://kamon.io/docs/latest/instrumentation/akka/ask-pattern-timeout-warning/ (didn't show any effect)

Is there other suggestions to understand the cause for this issue on high RPS system?

Thanks!


Solution

  • The Kamon instrumentation is useful for finding how you got to the ask. It can be useful if you have a lot of places where an ask can time out, but otherwise it's not likely to tell you the problem.

    This is because an ask timeout is nearly always a symptom of some other problem (the lone exception is if many asks could plausibly be done in a stream (e.g. in a mapAsync or ask stage) but aren't; that doesn't apply in this code). Assuming that the timeouts aren't caused by (e.g.) a database being down so you're getting no reply or a cluster failing (both of these are fairly obvious, thus my assumption), the cause of a timeout (any timeout, generally) is often having too many elements in a queue ("saturation").

    But which queue? We'll start with the distributor, which is an actor processing messages one-at-a-time from its mailbox (which is a queue). When you say that the normal processing time is 700 micros, is that measuring the time the distributor spends handling a request (i.e. the time before it can handle the next request)? If so, and the distributor is taking 700 micros, but requests come in every 600 micros, this can happen:

    and so on: the latency and depth increase without bound. Eventually, the depth is such that requests spend 5 seconds (or more, even) in the mailbox.

    Kamon has the ability to track the number of messages in the mailbox of an actor (it's recommended to only do this on specific actors). Tracking the mailbox depth of distributor in this case would show it growing without bound to confirm that this is happening.

    If the distributor's mailbox is the queue that's getting too deep, first consider how request N can affect request N + 1. The one-at-a-time processing model of an actor is only strictly required when the response to a request can be affected by the request immediately prior to it. If a request only concerns some portion of the overall state of the system then that request can be handled in parallel with requests that do not concern any part of that portion. If there are distinct portions of the overall state such that no request is ever concerned with 2 or more portions, then responsibility for each portion of state can be offloaded to a specific actor and the distributor looks at each request only for long enough to determine which actor to forward the request to (note that this will typically not entail the distributor making an ask: it hands off the request and its the responsibility of the actor it hands off to (or that actor's designee...) to reply). This is basically what Cluster Sharding does under the hood, and it's also noteworthy that doing this will probably increase the latency under low load (because you are doing more work), but increases peak throughput by up to the number of portions of state.

    If that's not a workable way to address the distributor's mailbox being saturated (viz. there's no good way to partition the state), then you can at least limit the time requests spend in the mailbox by including a "respond-by" field in the request message (e.g. for a 5 second ask timeout, you might require a response by 4900 millis after constructing the ask). When the distributor starts processing a message and the respond-by time has passed, it moves onto the next request: doing this effectively means that when the mailbox starts to saturate, the message processing rate increases.

    Of course, it's possible that your distributor's mailbox isn't the queue that's getting saturated, or that if it is, it's not because the actor is spending too much time processing messages. It's possible that the distributor (or other actors needed for a response) aren't processing messages.

    Actors run inside a dispatcher which has the ability to have some number of actors (or Future callbacks or other tasks, each of which can be viewed as equivalent to an actor which is spawned for processing a single message) processing a message at a given time. If there are more actors which have a message in their respective mailboxes than the number that can be processing a message, those actors are in a queue to be scheduled (note that this applies even if you happen to have a dispatcher which will spawn as many threads as it needs to process a message: since there are a limited number of CPU cores, the OS kernel scheduler's queue will take the role of the dispatcher queue). Kamon can track the depth of this queue. In my experience, it's more valuable to detect thread starvation (basically whether the time between task submission and when the task starts executing exceeds some threshold) is occurring. Lightbend's package of commercial tooling for use with Akka (disclaimer: I am employed by Lightbend) provides tools for detecting, with minimal overhead, whether starvation is occurring and providing other diagnostic information.

    If thread starvation is being observed, and things like garbage collection pauses, or CPU throttling (e.g. due to running in a container) are ruled out, the primary cause of starvation is actors (or actor-like things) taking too long to process a message either because they are executing blocking I/O or are doing too much in the processing of a single message. If blocking I/O is the culprit, try to move the I/O to actors or futures running in a thread pool with far more threads than the number of CPU cores (some even advocate for an unbounded thread pool for this purpose). If it's a case of doing too much computation in processing a single message, look for spots in the processing where it makes sense to capture the state needed for the remainder of the computation in a message and send that message to yourself (this is basically equivalent to a coroutine yielding).