cassandracassandra-2.2

Cassandra 3 nodes cluster throwing NoHostAvailableException as soon as one node is down


We have a 3 nodes cluster with a RF 3.

As soon as we drain one node from the cluster we see many:

All host(s) tried for query failed (no host was tried)
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
        at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
        at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
        at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:214)
        at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)

All our writes and read are with a Consistency Level QUORUM or ONE so with one node down everything should work perfectly. But as long as the node is down, exceptions are thrown.

We use Cassandra 2.2.4 + Java Cassandra Driver 2.1.10.2

Here's how we create our cluster:

new Cluster.Builder()
    .addContactPoints(CONTACT_POINTS)
    .withCredentials(USERNAME, PASSWORD)
    .withRetryPolicy(new LoggingRetryPolicy(DefaultRetryPolicy.INSTANCE))
    .withReconnectionPolicy(new ExponentialReconnectionPolicy(10, 10000))
    .withLoadBalancingPolicy(new TokenAwarePolicy(new RoundRobinPolicy()))
    .withSocketOptions(new SocketOptions().setReadTimeoutMillis(12_000))
    .build();

CONTACT_POINTS is a String array of the 3 public ips of the nodes.

A few months ago, the cluster was working fine with temporarily only 2 nodes but for an unknown reason it's not the case anymore and I'm running out of ideas :(

Thanks a lot for your help!


Solution

  • Problem solved.

    More analysis showed that the issue was coming from an IP problem. Our cassandra servers use private local IPs (10.0.) to communicate together while our app servers have their public IPs in their config.

    When they were in the same network it was working properly but as they moved to a different network, they were able to connect to only one machine of the cluster and the other two were considered as down as they were trying to connect to the private local IPs instead of the public one for the other two.

    The solution was to add an IPTranslater in the cluster builder:

    .withAddressTranslater(new ToPublicIpAddressTranslater())
    

    With the following code:

    private static class ToPublicIpAddressTranslater implements AddressTranslater {
    
        private Map<String, String> internalToPublicIpMap = new HashMap<>();
    
        public ToPublicIpAddressTranslater() {
            for (int i = 0; i < CONTACT_POINT_PRIVATE_IPS.length; i++) {
                internalToPublicIpMap.put(CONTACT_POINT_PRIVATE_IPS[i], CONTACT_POINTS[i]);
            }
        }
    
        @Override
        public InetSocketAddress translate(InetSocketAddress address) {
            String publicIp = internalToPublicIpMap.get(address.getHostString());
            if (publicIp != null) {
                return new InetSocketAddress(publicIp, address.getPort());
            }
            return address;
        }
    }