performanceapache-kafkakafka-producer-api

Kafka: is it better to have a lot of small messages or fewer, but bigger ones?


There is a microservice, which receives the batch of the messages from the outside and push them to kafka. Each message is sent separately, so for each batch I have around 1000 messages 100 bytes each. It seems like the messages take much more space internally, because the free space on the disk going down much faster than I expected.

I'm thinking about changing the producer logic, the way it will put all the batch in one message (the consumer then will split them by itself). But I haven't found any information about space or performance issues with many small messages, neither any guildlines about balance between size and count. And I don't know Kafka enough to have my own conclusion.

Thank you.


Solution

  • The producer will, by itself, batch messages that are destined to the same partition, in order to avoid unnecesary calls.

    enter image description here

    The producer makes this thanks to its background threads. In the image, you can see how it batches 3 messages before sending them to each partition.

    If you also set compression in the producer-side, it will also compress (GZip, LZ4, Snappy are the valid codecs) the messages before sending it to the wire. This property can also can be set on the broker-side (so the messages are sent uncompressed by the producer, and compressed by the broker).

    It depends on your network capacity to decide wether you prefer a slower producer (as the compression will slow it) or bigger load on the wire. Note that setting a big compression level on big files may affect a lot your overall performance.

    Anyway, I believe the big/small msg problem hurts a lot more to the consumer side; Sending messages to Kafka is easy and fast (the default behaviour is async, so the producer won't be too busy). But on the consumer side, you'll have to look the way you are processing the messages:


    1. One Consumer-Worker

    Here you couple consuming with processing. This is the simplest way: the consumer sets its own thread, reads a kafka msg and process it. Then continues the loop.

    1. One Consumer - Many workers

    Here you decouple consuming and processing. In most cases, reading from kafka will be faster than the time you need to process the message. It is just physics. In this approach, one consumer feeds many separate worker threads that share the processing load.


    More info about this here, just above the Constructors area.

    Why do I explain this? Well, if your messages are too big, and you choose the first option, your consumer may not call poll() within the timeout interval, so it will rebalance continuosly. If your messages are big (and take some time to be processed), better choose to implement the second option, as the consumer will continue its own way, calling poll() without falling in rebalances.

    If the messages are too big and too many, you may have to start thinking about different structures than can buffer the messages into your memory. Pools, deques, queues, for example, are different options to acomplish this.

    You may also increase the poll timeout interval. This may hide you about dead consumers, so I don't really recommend it.


    So my answer would be: it depends, basicallty on: your network capacity, your required latency, your processing capacity. If you are able to process big messages equally fast as smaller ones, then I wouldn't care much.

    Maybe if you need to filter and reprocess older messages I'd recommend partitioning the topics and sending smaller messages, but it's only a use-case.