cassandradatastax-enterprisedatastax-startup

Cassandra host in cluster with null ID


Note: We are seeing this issue in our Cassandra 2.1.12.1047 (DSE 4.8.4) cluster with 6 nodes across 3 regions (2 in each region).

Trying to update schemas on our cluster recently, we found the updates were failing. We suspected one node in the cluster was not accepting the change.

When checking the system.peers table of one of our servers in us-east-1, that it had an anomaly, it had what seemed to be a complete entry for a host that does not exist.

cassandra@cqlsh> SELECT peer, host_id FROM system.peers WHERE peer IN ('54.158.22.187', '54.196.90.253');

peer          | host_id
---------------+--------------------------------------
54.158.22.187 | 8ebb7f2c-8f81-44af-814b-a537b84834e0

As that host did not exist, I tried to remove it using nodetool removenode but that failed error: Cannot remove self -- StackTrace -- java.lang.UnsupportedOperationException: Cannot remove self

We know that the .187 server was abruptly terminated a few weeks ago due to an EC2 issue.

We had numerous attempts at trying to make the server healthy, but then in the end simply terminated the server that was reporting this .187 host in the system.peers, ran a nodetool removenode from one of the other servers, and then brought a new server online.

The new server came online, and in an hour or so seemed to have caught up on the backlog of activity needed to bring it inline with the other servers (assumption based purely on CPU monitoring).

However, things are now very odd because the .187 that was reported in the system.peers tables is appearing when we run a nodetool status from any server in the cluster other than the new one we just brought online.

$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
DN  54.158.22.187   ?          256     ?       null                                  r1
Datacenter: cassandra-ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  54.255.xx.xx    7.9 GB     256     ?       a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.255.xx.xx    8.2 GB     256     ?       b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: cassandra-eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  176.34.xx.xxx   8.51 GB    256     ?       30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.xx.xxx   8.4 GB     256     ?       f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
Datacenter: cassandra-us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  54.225.xx.xxx   8.17 GB    256     ?       0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  54.224.xx.xxx   3.66 GB    256     ?       1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d

As there is no way I know of to delete a node that does not have a Host ID, I am quite perplexed.

What can I do to get rid of this rogue node?

Note: Here is the result from a describecluster

$ nodetool describecluster
Cluster Information:
  Name: XXX
  Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
  Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
  Schema versions:
    d140bc9b-134c-3dbe-929f-7a84c2cd4532: [54.255.17.28, 176.34.207.151, 54.225.11.249, 54.195.174.72, 54.224.182.94, 54.255.64.1]

    UNREACHABLE: [54.158.22.187]

Solution

  • I've never had to do this myself, but probably the only thing left for you to do is to assassinate the endpoint. This was made into a nodetool command (nodetool assassinate) in Cassandra 2.2. But prior to that version, the only way to do it is via JMX. Here's a Gist with detailed instructions (instructions and code by Justen Walker).

    Prerequisites

    1. Log onto existing cluster alive node

    2. Download JMX Term

    wget

    $ wget -q -O jmxterm.jar
    > http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
    > curl
    

    or

     $ curl -s -o jmxterm.jar
     http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
    
    1. Run jmxterm
    $ java -jar ./jmxterm.jar
    Welcome to JMX terminal. Type "help" for available commands.
    $>
    

    Assassinate node

    Example bad node: 10.0.0.100

    • Connect to the local cluster
    • Select the Gossiper MBean Run the unsafeAssassinateEndpoint with the ip of the bad node
    $>open
    localhost:7199
    #Connection to localhost:7199 is opened 
    
    $>bean org.apache.cassandra.net:type=Gossiper
    #bean is set to org.apache.cassandra.net:type=Gossiper
    
    $>run unsafeAssassinateEndpoint 10.0.0.100
    #calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper
    #operation returns: null 
    
    $>quit
    

    Update 20160308:

    I've never had to do this myself

    Just had to do this myself. Totally looked-up and followed the steps in my own answer, too.

    Update 20220925:

    As of Cassandra 3.0, you can complete this task simply by running:

    nodetool assassinate 10.0.0.100