cassandra

Cassandra availability while index summaries redistribution being executed


I have my backend Flask server connected to Cassandra and I'm seeing Cassandra becomes unavailable for a short time every 1 hour. I'm using cassandra-driver==3.25.0.

2024-12-24T16:07:00.363156408Z Traceback (most recent call last):
2024-12-24T16:07:00.363163598Z   File "cassandra/cluster.py", line 3522, in cassandra.cluster.ControlConnection._reconnect_internal
2024-12-24T16:07:00.363166032Z   File "cassandra/cluster.py", line 3544, in cassandra.cluster.ControlConnection._try_connect
2024-12-24T16:07:00.363168189Z   File "cassandra/cluster.py", line 1620, in cassandra.cluster.Cluster.connection_factory
2024-12-24T16:07:00.363170583Z   File "cassandra/connection.py", line 831, in cassandra.connection.Connection.factory
2024-12-24T16:07:00.363176028Z   File "/usr/local/lib/python3.7/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
2024-12-24T16:07:00.363178247Z     self._connect_socket()
2024-12-24T16:07:00.363180375Z   File "cassandra/connection.py", line 898, in cassandra.connection.Connection._connect_socket
2024-12-24T16:07:00.363182758Z ConnectionRefusedError: [Errno 111] Tried connecting to [('172.18.0.5', 9042)]. Last error: Connection refused

I did a bit of research and found out that Cassandra internally perform index summaries distribution periodically (once every 60 minutes by default).

2024-12-25T16:10:46.731537895Z INFO  [IndexSummaryManager:1] 2024-12-26 01:10:46,730 IndexSummaryRedistribution.java:77 - Redistributing index summaries
2024-12-25T17:10:46.744763561Z INFO  [IndexSummaryManager:1] 2024-12-26 02:10:46,743 IndexSummaryRedistribution.java:77 - Redistributing index summaries
2024-12-25T18:10:46.752774412Z INFO  [IndexSummaryManager:1] 2024-12-26 03:10:46,751 IndexSummaryRedistribution.java:77 - Redistributing index summaries

I have only a single cluster deployed for Cassandra so I'm wondering if this is the reason the above behavior is happenning because there is no node availble to handle the queries while doing the redistribution thing.

Hopefully someone can shed a light on this for me. Thank you.


Solution

  • Cassandra maintains an index of the partitions stored in SSTables (in the corresponding *-Index.db) as a strategy for quickly locating where the partition resides on disk.

    Since the partition index is stored on disk together with the data in the corresponding SSTable, Cassandra also maintains a summary of the partition index in-memory (a sampling of the keys) also as an optimisation for faster index lookups.

    The index summary redistribution you refer to is one of the background processes that reallocates memory so that SSTables which are accessed more frequently ("hot" data) get preference over those which are "cold".

    To respond to your question directly, this is a periodic task that does not impact the regular operation the cluster so the message is logged at INFO level, not WARN or ERROR. It is not the reason you are seeing the errors you posted.

    A ConnectionRefusedError is network-related which indicates that the client is not able to reach the server because the service is not responding on a particular port (CQL client port 9042 by default). If Cassandra is down or unresponsive for whatever reason, it would explain why the driver cannot connect to the node.

    You need to review the Cassandra logs for clues as to why the node is unreachable. For example, if the nodes are overloaded then it's possible that they don't respond in time before the client gave up. Cheers!