java spring-boot cassandra spring-webflux

How do I increase the default timeout in the Cassandra Java driver using the DriverConfigLoader?

Small question regarding a Spring Webflux Reactive Cassandra application please.

On a setup Spring Boot 2.6.4 with Webflux and reactive Cassandra, I am using the app to insert some data in Cassandra tables.

Things works fine, until when there is a higher load, I am seeing an issue (stack trace attached)

The thing is, reading some documentation, I thought this piece of code would help solve the issue:

    @Bean
    public DriverConfigLoaderBuilderCustomizer defaultProfile(){
      return builder -> builder
        .withInt(DefaultDriverOption.METADATA_SCHEMA_REQUEST_TIMEOUT, 5000)
        .build();
    }

(here is my Cassandra Config, note I am using CqlSession.builder())

    protected CqlSession cqlLocalSession() {
      return CqlSession.builder()
        .addContactEndPoint(new DefaultEndPoint(new InetSocketAddress(contactPoints, port)))
        .withKeyspace(keyspace)
        .withLocalDatacenter(datacenter)
        .withAuthCredentials(username, passPhrase)
        .build();
    }

Unfortunately, the DriverConfigLoaderBuilderCustomizer does not seem to work, as I am still facing Query timed out after PT2S; (I would have at least expected to see PT5S.

May I ask what is the issue, and how to properly increase the timeout please?

Thank you

org.springframework.data.cassandra.CassandraUncategorizedException: ReactivePreparedStatementCallback; CQL [INSERT INTO mutable (x,y,z) VALUES (?,?,?)]; Query timed out after PT2S; nested exception is com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
    at org.springframework.data.cassandra.core.cql.CassandraExceptionTranslator.translate(CassandraExceptionTranslator.java:162) ~[spring-data-cassandra-3.3.2.jar!/:3.3.2]
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
    *__checkpoint ⇢ Handler myHandler [DispatcherHandler]
Original Stack Trace:
        at org.springframework.data.cassandra.core.cql.CassandraExceptionTranslator.translate(CassandraExceptionTranslator.java:162) ~[spring-data-cassandra-3.3.2.jar!/:3.3.2]
        at org.springframework.data.cassandra.core.cql.ReactiveCassandraAccessor.translate(ReactiveCassandraAccessor.java:149) ~[spring-data-cassandra-3.3.2.jar!/:3.3.2]
        at org.springframework.data.cassandra.core.cql.ReactiveCqlTemplate.lambda$translateException$15(ReactiveCqlTemplate.java:820) ~[spring-data-cassandra-3.3.2.jar!/:3.3.2]
        at reactor.core.publisher.Flux.lambda$onErrorMap$28(Flux.java:6911) ~[reactor-core-3.4.15.jar!/:3.4.15]
        at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94) ~[reactor-core-3.4.15.jar!/:3.4.15]
        at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onError(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-instrumentation-3.1.1.jar!/:3.1.1]
        at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204) ~[reactor-core-3.4.15.jar!/:3.4.15]
        at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121) ~[reactor-core-3.4.15.jar!/:3.4.15]
        at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onError(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-instrumentation-3.1.1.jar!/:3.1.1]
        at reactor.core.publisher.MonoCompletionStage.lambda$subscribe$0(MonoCompletionStage.java:80) ~[reactor-core-3.4.15.jar!/:3.4.15]
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareAsyncProcessor.lambda$process$1(CqlPrepareAsyncProcessor.java:71) ~[java-driver-core-4.13.0.jar!/:na]
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.setFinalError(CqlPrepareHandler.java:320) ~[java-driver-core-4.13.0.jar!/:na]
        at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.lambda$scheduleTimeout$1(CqlPrepareHandler.java:163) ~[java-driver-core-4.13.0.jar!/:na]
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S
    at com.datastax.oss.driver.internal.core.cql.CqlPrepareHandler.lambda$scheduleTimeout$1(CqlPrepareHandler.java:163) ~[java-driver-core-4.13.0.jar!/:na]
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.74.Final.jar!/:4.1.74.Final]
    at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]

Solution

You have configured the wrong option on the driver. METADATA_SCHEMA_REQUEST_TIMEOUT is the timeout for the requests to retrieve the schema.

The default request timeout for the Java driver is basic.request.timeout (see reference configuration):

datastax-java-driver {
  basic.request {
    timeout = 2 seconds
  }
}

The equivalent typed driver option for basic.request.timeout is REQUEST_TIMEOUT (reference TypedDriverOption.java).

I should point out that increasing the request timeout is almost always never the correct way of handling the issue you're facing because you're just hiding the problem without actually addressing the root cause.

As you already pointed out at the start of your post, the problem arises when the load is higher. The reason you're getting DriverTimeoutException is the coordinators do not respond in time and it's not difficult to conclude that it is due to nodes being overloaded.

Writes in Cassandra are very fast because the mutations are appended to the end of the commitlog -- there are no disk seeks. If the disk for the commitlog is slow or cannot keep up with writes, you need to look at:

using faster disks,
mounting the commitlog is on a separate disk/volume so it's not competing for the same IO as reads,
increase the capacity of your cluster by adding more nodes (spread the load to more nodes), or
all of the above.