[SOLVED] etcd v2: etcd-server is healthy but etcd-events is not joining ("cluster ID mismatch" and "unmatched member while checking PeerURLs" errors)

I have a legacy Kubernetes cluster running etcd v2 with 3 masters (etcd-a, etcd-b, etcd-c). We attempted an upgrade to etcd v3 but this broken the first master (etcd-a) and it was no longer able to join the cluster. After some time I was able to restore it:

removed etcd-a from etcd cluster with etcdctl member rm
added a new etcd-a1 with a clean state and added to the cluster etcdctl member add
started kubelet with ETCD_INITIAL_CLUSTER_STATE set to existing, then started protokube. At this point the master is able to join the cluster.

At the beginning I thought the cluster was healthy:

/ # etcdctl member list
a4***b2: name=etcd-c peerURLs=http://etcd-c.internal.mydomain.com:2380 clientURLs=http://etcd-c.internal.mydomain.com:4001
cf***97: name=etcd-a1 peerURLs=http://etcd-a1.internal.mydomain.com:2380 clientURLs=http://etcd-a1.internal.mydomain.com:4001
d3***59: name=etcd-b peerURLs=http://etcd-b.internal.mydomain.com:2380 clientURLs=http://etcd-b.internal.mydomain.com:4001

/ # etcdctl cluster-health
member a4***b2 is healthy: got healthy result from http://etcd-c.internal.mydomain.com:4001
member cf***97 is healthy: got healthy result from http://etcd-a1.internal.mydomain.com:4001
member d3***59 is healthy: got healthy result from http://etcd-b.internal.mydomain.com:4001
cluster is healthy

Yet the status of etcd-events is not great. etcd-events for a1 is not running

etcd-server-events-ip-a1       0/1     CrashLoopBackOff   430
etcd-server-events-ip-b        1/1     Running            3
etcd-server-events-ip-c        1/1     Running            0

Logs from etcd-events-a1:

flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-events-a1.internal.mydomain.com:4002
flags: recognized and used environment variable ETCD_DATA_DIR=/var/etcd/data-events
flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-events-a1.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-events-a1=http://etcd-events-a1.internal.mydomain.com:2381,etcd-events-b=http://etcd-events-b.internal.mydomain.com:2381,etcd-events-c=http://etcd-events-c.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-token-etcd-events
flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4002
flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2381
flags: recognized and used environment variable ETCD_NAME=etcd-events-a1
etcdmain: etcd Version: 2.2.1
etcdmain: Git SHA: 75f8282
etcdmain: Go Version: go1.5.1
etcdmain: Go OS/Arch: linux/amd64
etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
etcdmain: the server is already initialized as member before, starting as etcd member...
etcdmain: listening for peers on http://0.0.0.0:2381
etcdmain: listening for client requests on http://0.0.0.0:4002
netutil: resolving etcd-events-b.internal.mydomain.com:2381 to 10.15.***:2381
netutil: resolving etcd-events-a1.internal.mydomain.com:2381 to 10.15.***:2381
etcdmain: stopping listening for client requests on http://0.0.0.0:4002
etcdmain: stopping listening for peers on http://0.0.0.0:2381
etcdmain: error validating peerURLs {ClusterID:5a***b3 Members:[&{ID:a7***32 RaftAttributes:{PeerURLs:[http://etcd-events-b.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-b ClientURLs:[http://etcd-events-b.internal.mydomain.com:4002]}} &{ID:cc***b3 RaftAttributes:{PeerURLs:[https://etcd-events-a.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-a ClientURLs:[https://etcd-events-a.internal.mydomain.com:4002]}} &{ID:7f***2ca RaftAttributes:{PeerURLs:[http://etcd-events-c.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-c ClientURLs:[http://etcd-events-c.internal.mydomain.com:4002]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs

# restart
...

etcdserver: restarting member eb***3a in cluster 96***07 at commit index 3
raft: eb***a3a became follower at term 12407
raft: newRaft eb***3a [peers: [], term: 12407, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
etcdserver: starting server... [version: 2.2.1, cluster version: to_be_decided]
etcdserver: added local member eb***3a [http://etcd-events-a1.internal.mydomain.com:2381] to cluster 96***07
etcdserver: added member 7f***ca [http://etcd-events-c.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***b3, local=96***07)
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***3, local=96***07)
rafthttp: failed to dial 7f***ca on stream Message (cluster ID mismatch)
rafthttp: failed to dial 7f***ca on stream MsgApp v2 (cluster ID mismatch)
etcdserver: added member a7***32 [http://etcd-events-b.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
rafthttp: failed to dial a7***32 on stream MsgApp v2 (cluster ID mismatch)

...

rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
osutil: received terminated signal, shutting down...
etcdserver: aborting publish because server is stopped

Logs from etcd-events-b:

rafthttp: streaming request ignored (cluster ID mismatch got 96***07 want 5a***b3)
rafthttp: the connection to peer cc***b3 is unhealthy

Logs from etcd-events-c:

etcdserver: failed to reach the peerURL(https://etcd-events-a.internal.mydomain.com:2381) of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
etcdserver: cannot get the version of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)

From the log I saw two problems:

etcd-events on 1a seems to ignore the existing cluster (then IDs doesn't match).
the other nodes (b and c) still somehow remembers the removed old node a.

I'm short of ideas on how to fix this. Any suggestion?

Thanks!