postgresqlamazon-web-servicesplayframeworkslickcloudcaptain

Play Framework 2.5 Requests timing out randomly


Symptom

After some time of running just fine our backend will stop giving responses for most of its endpoints. It will just start behaving like a blackhole for those. Once in this state, it will stay there if we don't take any action.

Update

We can reproduce this behaviour with a db dump we made when the backend was in the non responding state.

Infrastructure Setup

We are running Play 2.5 in AWS on an EC2 instance behind a loadbalancer with a PostgreSQL database on RDS. We are using slick-pg as our database connector.

What we know

Here a few things we figured out so far.

About the HTTP requests

Our logs and debugging shows us that the requests are passing the filters. Also, we see that for the authentication (we are using Silhoutte for that) the application is able to perform database queries to receive the identity for that request. The controllers action will just never be called, though.

The backend is responing for HEAD requests. Further logging showed us that it seems that Controllers using injected services (we are using googles guice for that) are the ones whose methods are not being called anymore. Controllers without injected services seem to work fine.

About the EC2 instance

Unfortunately, we are not able to get much information from that one. We are using boxfuse which provides us with an immutable and by that in-ssh-able infrastructure. We are about to change this to a docker based deployment and might be able to provide more information soon. Nevertheless, we have New Relic setup to monitor our servers. We cannot find anything suspicious there. The memory and CPU usages look fine.

Still, this setup gives us a new EC2 instance on every deployment anyway. And even after a redeployment the issue persists at least for most of the times. Eventually it is possible to resolve this with a redeployment.

Even more weird is the fact that we can run the backend locally connected to the Database on AWS and everything will just work fine there.

So it is hard to say for us where the problem lies. It seems the db is not working with any EC2 instance (until it will work with a new one eventually) but with our local machines.

About the Database

The db is the only stateful entity in this setup, so we think the issue somehow should be related to it.

As we have a production and a staging environment, we are able to dump the production db into staging when the later is not working anymore. We found that this indeed resolves the issue immediately. Unfortunately, we were not able to take a snapshot from a somehow corrupt database to dump it into the staging environment and see whether this will break it immediately. We have a snapshot of the db when the backend was not responding anymore. When we dump this to our staging environment the backend will stop responding immediately.

The number of connections to the DB is around 20 according to the AWS console which is normal.

TL;DR

Update 1

Responding to marcospereira:

Take for instance this ApplicationController.scala:

package controllers

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future

import akka.actor.ActorRef
import com.google.inject.Inject
import com.google.inject.name.Named
import com.mohiva.play.silhouette.api.Silhouette
import play.api.i18n.{ I18nSupport, MessagesApi }
import play.api.mvc.Action
import play.api.mvc.Controller

import jobs.jobproviders.BatchJobChecker.UpdateBasedOnResourceAvailability
import utils.auth.JobProviderEnv

/**
 * The basic application controller.
 *
 * @param messagesApi The Play messages API.
 * @param webJarAssets The webjar assets implementation.
 */
class ApplicationController @Inject() (
  val messagesApi: MessagesApi,
  silhouette: Silhouette[JobProviderEnv],
  implicit val webJarAssets: WebJarAssets,
  @Named("batch-job-checker") batchJobChecker: ActorRef
)
    extends Controller with I18nSupport {


  def index = Action.async { implicit request =>
    Future.successful(Ok)
  }

  def admin = Action.async { implicit request =>
    Future.successful(Ok(views.html.admin.index.render))
  }


  def taskChecker = silhouette.SecuredAction.async {
    batchJobChecker ! UpdateBasedOnResourceAvailability
    Future.successful(Ok)
  }

}

The index and admin are just working fine. The taskchecker will show the weird behaviour, though.

Update 2

We are able to reproduce this issue now! We found that we made a db dump the last time our backend was not responding anymore. When we dump this into our staging database, the backend will stop responding immediately.

We started logging the number of threads now in one of our filters using Thread.getAllStackTraces.keySet.size and found that there are between 50 and 60 threads running.

Update 3

As @AxelFontaine suggested we enabled MultiAZ Deployment failover for the database. We rebooted the database with failover. Before, during and after the reboot the backend was not responding.

After the reboot we noticed that the number of connections to the db stayed 0. Also, we did not get any logs for authentication anymore (before we did so, the authentication step could even make db requests and would get responses).

After a rebooting of the EC2 instance, we are now getting

play.api.UnexpectedException: Unexpected exception[RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2@76d6ac53 rejected from java.util.concurrent.ThreadPoolExecutor@6ea1d0ce[Running, pool size = 4, active threads = 4, queued tasks = 5, completed tasks = 157]]

(we did not get those before)

for our requests as well as for our background jobs that need to access the db. Our slick settings now include

numThreads = 4
queueSize = 5
maxConnections = 10
connectionTimeout = 5000
validationTimeout = 5000

as suggested here

Update 4

After we got the exceptions as described in Update 3, the backend is now running fine again. We didn't do anything for that. This was the first time the backend would recover from this state without us being involved.


Solution

  • After some more investigations we found that one of our jobs was generating deadlocks in our database. The issue we were running into is a known bug in the slick version we were using and is reported on github.

    So the problem was that we were running DB transactions with .transactionally within a .map of a DBIOAction on too many threads at the same time.