Can't connect to Scylla because CONTROL_CONNECTION_FAILED

I consistently can't connect to running Scylla cluster:

com.datastax.oss.driver.api.core.DriverTimeoutException:
  [s0|control|id: ..., L:/...:57579 - R:.../...:19042]
  Protocol initialization request, step 4
  (QUERY (SELECT cluster_name FROM system.local)):
  timed out after 5000 ms

If you think that this is a consequence of overloaded cluster then I'd like to know which server-side Scylla metric(s) can visualise this fact. Because I am not able to spot anything obviously maxed-out.

Edit: Originally I wanted to ask about the root cause of this problem but that's not a specific question.

Load per Scylla node seems to be ok:

Logs contain messages like this:

reader_concurrency_semaphore - (rate limiting dropped 920 similar messages)
Semaphore _read_concurrency_sem with 100/100 count and 3366732/104354283
memory resources: timed out, dumping permit diagnostics:
permits        count    memory    table/description/state
99             99       3257K     A/data-query/inactive
1              1        29K       A/data-query/active/used
1              0        2K        A/multishard-mutation-query/active/unused
4              0        0B        B/data-query/waiting
1              0        0B        A/shard-reader/waiting
2              0        0B        system_auth.role_attributes/data-query/waiting
25             0        0B        C/data-query/waiting
7              0        0B        D/data-query/waiting
14             0        0B        system_auth.roles/data-query/waiting
7              0        0B        E/data-query/waiting
58             0        0B        F/data-query/waiting
6              0        0B        G/data-query/waiting
772            0        0B        A/data-query/waiting

997            100        3288K        total

Total: 997 permits with 100 count and 3288K memory resources

Solution

The control connection is a dedicated connection established in the first step of your connection process, and involves querying system tables to discover your cluster's topology and schema, as well as reacting to server events, such as during topology or schema changes.

In your case, the driver gave up past the default 5s period, which is likely an indication that your cluster is overloaded.

You may increase the connect-timeout, init-query-timeout, and other relevant settings in your driver configuration, as listed within the driver's reference configuration page (look for advanced.connection).

Remember that if you are using a ScyllaDB driver (you should) then shard-awareness means that the driver will establish one connection per CPU. If several clients try to establish a connection in parallel, then several TCP sockets may be opened concurrently, which may result in the effect you are seeing.

Therefore, either increase the aforementioned timeouts, ensure your clients do not introduce a connection storm against your database, or find the best balance among these two.

From your diagnostic dump, it looks like your permits are maxed out, indicating disk is likely the bottleneck. You may refer to the Reader concurrency semaphore page in order to understand the meaning of each entry.

Hope that helps!