One of our Kafka brokers had a very high load average (about 8 on average) in an 8 core machine. Although this should be okay but our cluster still seems to be facing problems and producers were failing to flush messages at the usual pace.
Upon further investigation, I found that my java process was waiting too much for IO, almost 99.99% of the time and as of now, I believe this is a problem.
Mind that this happened even when the load was relatively low (around 100-150 Kbps), I have seen it perform perfectly even with 2 Mbps of data input into the cluster.
I am not sure if this problem is because of Kafka, I am assuming it is not because all other brokers worked fine during this time and our data is perfectly divided among the 5 brokers.
Please assist me in finding the root cause of the problem. Where should I look to find the problem? Are there any other tools that can help me debug this problem?
We are using 1 TB mounted EBS Volume on an m5.2x large machine.
Please feel free to ask any questions.
Answering my own question after figuring out the problem.
It turns out that the real problem was associated with the way st1 HDD drive works rather than kafka or GC.
st1 HDD volume type is optimized for workloads involving large, sequential I/O, and performs very bad with small random IOs. You can read more about it here. Although It should have worked fine for just Kafka, but we were writing Kafka application logs to the same HDD, which was adding a lot to the READ/WRITE IOs and subsequently depleting our burst credits very fast during peak time. Our cluster worked fine as long as we had burst credits available and the performance reduced after the credits depleted.
There are several solutions to this problem :
This article helped me a lot to figure out the problem.
Thanks.