mysqljdbcamazon-rdstomcat-jdbc

Tomcat 8.5 Connection Pool not reconnecting after DB failover


I have an application using Tomcat 8.5 connection pool, Java 8, and Multi-AZ AWS RDS MySQL database. In the last years, we had a couple of database issues that lead to failover. When the failover occurred, the pool was always able to detect the connection was closed (No operations allowed after connection closed) and reconnect correctly a minute later when the backup node is up.

Some days ago we had a failover that didn't follow this rule. Because of a hardware database issue, the database was unavailable and a failover took place. Then, when the backup node was up a couple of minutes later, we could connect correctly to the database from our desktop MySQL client.

Even several minutes after the failover took place and connectivity to database was recovered, the application showed logs hundreds of exceptions like:

com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed
...
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
...
The last packet successfully received from the server was 20,017 milliseconds ago.  The last packet sent successfully to the server was 20,016 milliseconds ago
...
Caused by: java.net.SocketTimeoutException: Read timed out
...

The application couldn't reconnect until we restarted the Tomcat servers.

Our pool is configured this way:

initialSize = 5
maxActive = 16
minIdle = 5
maxIdle = 8
maxWait = 10000
maxAge = 600000
timeBetweenEvictionRunsMillis = 5000
minEvictableIdleTimeMillis = 60000
validationQuery = "SELECT 1"
validationQueryTimeout = 3
validationInterval = 15000
testOnBorrow = true
testWhileIdle = true
testOnReturn = false
jdbcInterceptors = "ConnectionState;StatementCache(max=200)"
defaultTransactionIsolation = java.sql.Connection.TRANSACTION_READ_COMMITTED

And the JDBC connection URL has these parameters:

autoreconnect=true&socketTimeout=20000

Under my understanding, the validationQuery should have failed and the connection discarded, so a new correct connection should have created. Also, according to maxAge after 10 minutes all connections should have been discarded and new ones created.

The pool couldn't be recovered even after 20 minutes. As said, we had to restart the Tomcat servers.

Is there any explanation why the pool has always recovered correctly from a failover, but in this case, it couldn't?


Solution

  • I ended up adding an AWS RDS Proxy that resolves this issue.

    I have been provoking DB Failovers for an hour and everything worked fine with outages less than 20 seconds. And this, without modifying my application code, only pointing to the new proxy endpoint.