amazon-web-servicesserverioapache-kafkadisk-io

Why is IO 99.99 % even though the Disk Read And write seems to be very small


One of our Kafka brokers had a very high load average (about 8 on average) in an 8 core machine. Although this should be okay but our cluster still seems to be facing problems and producers were failing to flush messages at the usual pace.

Upon further investigation, I found that my java process was waiting too much for IO, almost 99.99% of the time and as of now, I believe this is a problem.

Mind that this happened even when the load was relatively low (around 100-150 Kbps), I have seen it perform perfectly even with 2 Mbps of data input into the cluster.

I am not sure if this problem is because of Kafka, I am assuming it is not because all other brokers worked fine during this time and our data is perfectly divided among the 5 brokers.

Please assist me in finding the root cause of the problem. Where should I look to find the problem? Are there any other tools that can help me debug this problem?

We are using 1 TB mounted EBS Volume on an m5.2x large machine.

Please feel free to ask any questions.

itop snapshot

enter image description here

GC Logs Snapshot enter image description here


Solution

  • Answering my own question after figuring out the problem.

    It turns out that the real problem was associated with the way st1 HDD drive works rather than kafka or GC.

    st1 HDD volume type is optimized for workloads involving large, sequential I/O, and performs very bad with small random IOs. You can read more about it here. Although It should have worked fine for just Kafka, but we were writing Kafka application logs to the same HDD, which was adding a lot to the READ/WRITE IOs and subsequently depleting our burst credits very fast during peak time. Our cluster worked fine as long as we had burst credits available and the performance reduced after the credits depleted.

    There are several solutions to this problem :

    1. First remove any external apps adding IO load to the st1 drive as its not meant for those kinds of small random IOs.
    2. Increase the number of such st1 parallel drives divide the load.This is easy to do with Kafka as it allows us to keep data in different directories in different drives. But only new topics will be divided as the partitions are assigned to directories when the topic is created.
    3. Use gp2 SSD drives as they kind of manage both kinds of loads very well. But these are expensive.
    4. Use larger st1 drives fit for your use case as the throughput and burst credits are dependent on the size of the disk. READ HERE

    This article helped me a lot to figure out the problem.

    Thanks.