apache-kafka

Kafka won't rejoin cluster after Broken Pipe error


I have a Kafka cluster with 3 brokers and 3 zookeepers. Based on the Kafka server.log, it intermittently encounter broken pipe error for some unknown reason. After the broken pipe error gone, instead of rejoin the cluster the Kafka broker decide to stay out of cluster and become its own leader (since it shrinks the ISR from 3 to 1).

So far the only workaround is to restart the broker and it will rejoin cluster normally as a follower. But we can't keep restarting manually every time a similar issue appear.

[2019-05-10 10:32:48,344] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
    at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:209)
    at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:172)
    at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:718)
    at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
    at org.apache.kafka.common.network.Selector.doClose(Selector.java:746)
    at org.apache.kafka.common.network.Selector.close(Selector.java:734)
    at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
    at kafka.network.Processor.poll(SocketServer.scala:628)
    at kafka.network.Processor.run(SocketServer.scala:545)
    at java.lang.Thread.run(Thread.java:745)
[2019-05-10 10:32:48,368] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
    at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:209)
    at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:159)
    at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:718)
    at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
    at org.apache.kafka.common.network.Selector.doClose(Selector.java:746)
    at org.apache.kafka.common.network.Selector.close(Selector.java:734)
    at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
    at kafka.network.Processor.poll(SocketServer.scala:628)
    at kafka.network.Processor.run(SocketServer.scala:545)
    at java.lang.Thread.run(Thread.java:745)
[2019-05-10 10:32:53,422] WARN Failed to send SSL Close message  (org.apache.kafka.common.network.SslTransportLayer)
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:65)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
    at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:209)
    at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:159)
    at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:718)
    at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61)
    at org.apache.kafka.common.network.Selector.doClose(Selector.java:746)
    at org.apache.kafka.common.network.Selector.close(Selector.java:734)
    at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:532)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:424)
    at kafka.network.Processor.poll(SocketServer.scala:628)
    at kafka.network.Processor.run(SocketServer.scala:545)
    at java.lang.Thread.run(Thread.java:745)
[2019-05-10 10:32:56,976] INFO [Partition CS_NL_CUSTOMER_ADD-1 broker=4] Shrinking ISR from 4,6 to 4 (kafka.cluster.Partition)
[2019-05-10 10:32:56,994] INFO [Partition CS_NL_CUSTOMER_ADD-1 broker=4] Cached zkVersion [24394] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-05-10 10:32:56,994] INFO [Partition _confluent-controlcenter-5-1-0-1-MetricsAggregateStore-changelog-3 broker=4] Shrinking ISR from 4,6 to 4 (kafka.cluster.Partition)
[2019-05-10 10:32:57,023] INFO [Partition _confluent-controlcenter-5-1-0-1-MetricsAggregateStore-changelog-3 broker=4] Cached zkVersion [3724] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-05-10 10:32:57,023] INFO [Partition TEST_3_PART-2 broker=4] Shrinking ISR from 4,6 to 4 (kafka.cluster.Partition)
[2019-05-10 10:32:57,033] INFO [Partition TEST_3_PART-2 broker=4] Cached zkVersion [3300] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

As far as I understand about Kafka, isn't Kafka broker supposedly to rejoin itself to cluster whenever possible? Why it isn't happening after got broken pipe error? One more thing is, any idea what caused Broken Pipe error? Is it network issue?


Solution

  • Update: The solution is to disable vMotion on the VM where Kafka is deployed.

    It seems that whenever vMotion perform a live migration, Kafka will be impacted and will result in broken pipe error. According to this link, Confluent also recommends to disable the vMotion as it can cause cluster outage.

    After disabling the vMotion, my Kafka cluster never encounters any broken pipe issues again. This still doesn't answer why it won't automatically rejoin the cluster after encounter broken pipe error, but at least it does solve the broken pipe error.