dockerhyperledger-fabrichyperledgerdocker-swarmca

TLS Handshake failed with error EOF when deploying Hyperledger Fabric on multi-host


I am trying to develop a package that allow users to dynamically deploy Hyperledger Fabric peer and orderer on multi-host.

The source code is uploaded here to clarify the question further. You can follow the following description to reproduce the problem I face.

First of all, I setup two machines on GCP, both of them have environments setup as shown as the following picture:

(https://i.sstatic.net/Q9Z4r.png)

Then I setup docker swarm between them, with an overlay network.

I build the containers on both machine, but with a different organization name and hostname.

Then I create a channel on the first machine, deploy a chaincode and invoke it. So far, so good. The rest is some trivial steps to add the second host/organization to the channel: the first organization signs for the updated config includes organizations two and updates the channel, and finally the second host/organization join the channel by the genesis block.

However, when I look into the docker logs, errors keep showing up constantly on the both sides:

On the second machine,

...
orderer0.130.211.248.179    | 2023-08-28 06:42:37.014 UTC 043e WARN [orderer.common.cluster.puller] probeEndpoint -> Failed connecting to {"CAs":[{"Expired":false,"Issuer":"self","Subject":"CN=fabric-ca-server,OU=Fabric,O=Hyperledger,ST=North Carolina,C=US"}],"Endpoint":"orderer0.34.81.53.133:7050"}: failed to create new connection: context deadline exceeded channel=biscechannel1
orderer0.130.211.248.179    | 2023-08-28 06:42:37.014 UTC 043f WARN [orderer.common.cluster.puller] func1 -> Received error of type 'failed to create new connection: context deadline exceeded' from orderer0.34.81.53.133:7050 channel=biscechannel1
orderer0.130.211.248.179    | 2023-08-28 06:42:37.014 UTC 0440 WARN [orderer.common.cluster.puller] connectToSomeEndpoint -> Could not connect to any endpoint of [{"CAs":[{"Expired":false,"Issuer":"self","Subject":"CN=fabric-ca-server,OU=Fabric,O=Hyperledger,ST=North Carolina,C=US"}],"Endpoint":"orderer0.34.81.53.133:7050"}] channel=biscechannel1
orderer0.130.211.248.179    | 2023-08-28 06:42:37.014 UTC 0441 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 5.000078214s with error: context canceled remoteaddress=10.0.1.5:7050
peer0.130.211.248.179       | 2023-08-28 06:42:37.414 UTC 02c2 WARN [peer.blocksprovider] DeliverBlocks -> Could not connect to ordering service: could not dial endpoint 'orderer0.34.81.53.133:7050': failed to create new connection: context deadline exceeded channel=biscechannel1
peer0.130.211.248.179       | 2023-08-28 06:42:37.415 UTC 02c3 WARN [peer.blocksprovider] DeliverBlocks -> Disconnected from ordering service. Attempt to re-connect in 5m4.771s channel=biscechannel1
peer0.130.211.248.179       | 2023-08-28 06:42:37.415 UTC 02c4 ERRO [comm.tls] ClientHandshake -> Client TLS handshake failed after 2.998565787s with error: context canceled remoteaddress=10.0.1.5:7050
...

On the first machine,

...
orderer0.34.81.53.133    | 2023-08-28 06:42:14.125 UTC 0abb INFO [orderer.consensus.etcdraft] hup -> 1 is starting a new election at term 2 channel=biscechannel1 node=1
orderer0.34.81.53.133    | 2023-08-28 06:42:14.125 UTC 0abc INFO [orderer.consensus.etcdraft] becomePreCandidate -> 1 became pre-candidate at term 2 channel=biscechannel1 node=1
orderer0.34.81.53.133    | 2023-08-28 06:42:14.125 UTC 0abd INFO [orderer.consensus.etcdraft] poll -> 1 received MsgPreVoteResp from 1 at term 2 channel=biscechannel1 node=1
orderer0.34.81.53.133    | 2023-08-28 06:42:14.125 UTC 0abe INFO [orderer.consensus.etcdraft] campaign -> 1 [logterm: 2, index: 8] sent MsgPreVote request to 2 at term 2 channel=biscechannel1 node=1
orderer0.34.81.53.133    | 2023-08-28 06:42:14.125 UTC 0abf ERRO [orderer.consensus.etcdraft] logSendFailure -> Failed to send StepRequest to 2, because: EOF channel=biscechannel1 node=1
orderer0.34.81.53.133    | 2023-08-28 06:42:17.008 UTC 0ac0 ERRO [core.comm] ServerHandshake -> Server TLS handshake failed in 4.99736537s with error EOF server=Orderer remoteaddress=10.0.1.11:54756
...

It seems like they can't reach each other, but when I try to ping a container on a machine from a container on the other machine, it actually works.

After searching on this site and the Hyperledger discord channel, most says that they solve their TLS handshake problems by adjust their CA. But as you can see in the logs, if this is a CA problem, in my limited experience, I assume there should be messages like tls: bad certificate or x509: certificate signed by unknown authority, but instead it just says TLS handshake failed with error EOF. I know that doesn't mean that it must not be a CA problem, but I can't find what's going wrong with or without this assumation.

I recorded a video to demonstrate the whole thing here.

Please help or try to give some ideas how to solve this. Thanks a lot in advance.


Solution

  • I've managed to discover a solution just now. It appears that the mtu (Maximum Transmission Unit) value of default GCP VPC networks is initially set to 1460, and it seems not adequate for communications between fabric peers and orderers across machines under Docker swarm, so after I increased the value to 8896, the handshakes seldom fail again.