In production with 3 nodes, local quorum, sporadically insert fails and we just get Cassandra::Errors::TimeoutError
and not Cassandra::Errors::WriteTimeoutError
, which I think tells it's not able to connect to the node/s but I don't get Cassandra::Errors::NoHostsAvailable: All attempted hosts failed
I look at the cassandra logs there is nothing there, application logs shows the error
It's happening like 1k per day, and usually retries from caller side results in success...
my guess driver is having some issue
ruby '~> 2.7'
gem "cassandra-driver", "~> 3.2.5"
consistency: :local_quorum,
load_balancing_policies = {
dc_aware_round_robin: Cassandra::LoadBalancing::Policies::DCAwareRoundRobin.new(
datacenter,
cassandra_used_hosts_per_remote_dc
),
round_robin: Cassandra::LoadBalancing::Policies::RoundRobin.new
}
CASSANDRA_CONNECT_TIMEOUT_MS: '600'
CASSANDRA_CONSISTENCY: LOCAL_QUORUM
CASSANDRA_RECONNECT_INITIAL_INTERVAL_MS: '100'
CASSANDRA_RECONNECT_MAX_INTERVAL_MS: '3000'
CASSANDRA_RECONNECT_MAX_RETRIES: '5'
CASSANDRA_RETRIES: '5'
CASSANDRA_RETRY_MAX_MS: '3000'
CASSANDRA_RETRY_MIN_MS: '100'
So looked at the lib/cassandra/future.rb
# Returns future value or raises future error
#
# @note This method blocks until a future is resolved or a times out
#
# @param timeout [nil, Numeric] a maximum number of seconds to block
# current thread for while waiting for this future to resolve. Will
# wait indefinitely if passed `nil`.
#
# @raise [Errors::TimeoutError] raised when wait time exceeds the timeout
# @raise [Exception] raises when the future has been resolved with an
# error. The original exception will be raised.
#
# @return [Object] the value that the future has been resolved with
def get(timeout = nil)
@signal.get(timeout)
end
Cassandra::Errors::TimeoutError
Timed out
Crashed in non-app: cassandra/future.rb in get
cassandra/future.rb in get at line 402
cassandra/session.rb in execute at line 127
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:637:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/future.rb:402:in 'get',
/srv/_versions/events/events-202304261636-9ba0b992cd-master/vendor/bundle/ruby/2.7.0/gems/cassandra-driver-3.2.5/lib/cassandra/session.rb:127:in 'execute'
So i figured out the issue, just realize I never answered the question. Reason was large partition size, cassandra logs were bleeding with messages like
WARN [CompactionExecutor:170358] BigTableWriter.java:258 - Writing large partition xxx/yyy:1716208:2023-09-25-16-10 (103.262MiB) to sstable /data/cassandra/data/xxx/yyy-a88t665njhgs833sbjjkdl/nb-4343435-big-Data.db
Whenever flush to memtable happens for there > 100 MB partitions, it drastically increase the latency.
Solution -
It was rather simple our partition key was som eother col + bucket (extract yyyy-mm-dd-hh-mm from our clustering column of type timeuuid) and we are chopping the last digit from minute, so esentially anything within a 10 min window goes to a single partition, I changed it to 1 min. It stopped the bleeding while we redesign the table