Simple INSERT sporadically fails with Cassandra::Errors::TimeoutError, Cassandra::Errors::WriteTimeoutError

In production with 3 nodes, local quorum, sporadically insert fails and we just get Cassandra::Errors::TimeoutError and not Cassandra::Errors::WriteTimeoutError, which I think tells it's not able to connect to the node/s but I don't get Cassandra::Errors::NoHostsAvailable: All attempted hosts failed

I look at the cassandra logs there is nothing there, application logs shows the error

It's happening like 1k per day, and usually retries from caller side results in success...

my guess driver is having some issue

    ruby '~> 2.7'
    gem "cassandra-driver", "~> 3.2.5"

    consistency:           :local_quorum,

    load_balancing_policies = {
        dc_aware_round_robin: Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new(
            datacenter,
            cassandra_used_hosts_per_remote_dc
        ),
        round_robin: Cassandra::LoadBalancing::Policies::RoundRobin.new
    }

CASSANDRA_CONNECT_TIMEOUT_MS: '600'
CASSANDRA_CONSISTENCY: LOCAL_QUORUM
CASSANDRA_RECONNECT_INITIAL_INTERVAL_MS: '100'
CASSANDRA_RECONNECT_MAX_INTERVAL_MS: '3000'
CASSANDRA_RECONNECT_MAX_RETRIES: '5'
CASSANDRA_RETRIES: '5'
CASSANDRA_RETRY_MAX_MS: '3000'
CASSANDRA_RETRY_MIN_MS: '100'

So looked at the lib/cassandra/future.rb

# Returns future value or raises future error
    #
    # @note This method blocks until a future is resolved or a times out
    #
    # @param timeout [nil, Numeric] a maximum number of seconds to block
    #   current thread for while waiting for this future to resolve. Will
    #   wait indefinitely if passed `nil`.
    #
    # @raise [Errors::TimeoutError] raised when wait time exceeds the timeout
    # @raise [Exception] raises when the future has been resolved with an
    #   error. The original exception will be raised.
    #
    # @return [Object] the value that the future has been resolved with
    def get(timeout = nil)
      @signal.get(timeout)
    end

Cassandra::Errors::TimeoutError
Timed out

Crashed in non-app: cassandra/future.rb in get

cassandra/future.rb in get at line 402

cassandra/session.rb in execute at line 127

/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:637:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:402:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/session.rb:127:in 'execute'

Solution

So i figured out the issue, just realize I never answered the question. Reason was large partition size, cassandra logs were bleeding with messages like

WARN  [CompactionExecutor:170358] BigTableWriter.java:258 - Writing large partition xxx/yyy:1716208:2023-09-25-16-10 (103.262MiB) to sstable /data/cassandra/data/xxx/yyy-a88t665njhgs833sbjjkdl/nb-4343435-big-Data.db

Whenever flush to memtable happens for there > 100 MB partitions, it drastically increase the latency.

Solution -

It was rather simple our partition key was som eother col + bucket (extract yyyy-mm-dd-hh-mm from our clustering column of type timeuuid) and we are chopping the last digit from minute, so esentially anything within a 10 min window goes to a single partition, I changed it to 1 min. It stopped the bleeding while we redesign the table