I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event. However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.
Following instrumentation around the consume
method, implies that the consumer code itself is not taking time.
https://github.com/karafka/karafka/blob/master/lib/karafka/backends/inline.rb#L12
INFO Inline processing of topic production.events with 8 messages took 2571 ms
INFO 8 messages on production.events topic delegated to xyz
However, I notice two things:
When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.
Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?
====== Edit:
There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.
Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.
Partition | First Offset |
Last Offset |
Size | Leader Node |
Replica Nodes |
In-sync Replica Nodes |
Offline Replica Nodes |
Preferred Leader |
Under-replicated |
---|---|---|---|---|---|---|---|---|---|
[0] | 2174152 | 3567554 | 1393402 | 5 | 5,4,3 | 3,4,5 | Yes | No | |
1 | 2172222 | 3566886 | 1394664 | 4 | 4,5,6 | 4,5,6 | Yes | No | |
[2] | 2172110 | 3564992 | 1392882 | 1 | 1,6,4 | 1,4,6 | Yes | No |
However, I do see that one consumer perpetually lags behind the other two. Following table shows the lag for my consumers. There is one consumer process for each partition:
Partition | First Offset | Last Offset | Consumer Offset | Lag |
---|---|---|---|---|
0 | 2174152 | 3566320 | 2676120 | 890200 |
1 | 2172222 | 3565605 | 3124649 | 440956 |
2 | 2172110 | 3563762 | 3185587 | 378175 |
Combined lag | 1709331 |
Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume
function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.
Update after setting max_wait_time to 1 second (previously 5 second)
It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume
2021-06-24 13:43:23.425 Inline processing of topic x with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms
There are a couple of problems you may be facing. It is a bit of a guessing from my side without more details but let's give it a shot.
Are you sure you're evenly distributing data across partitions? Maybe it is eating up things from one partition?
What you wrote here:
INFO Inline processing of topic production.events with 8 messages took 2571 ms
This indicates that there was a batch of 8 processed altogether by a single consumer. This could indicate that the data is not distributed evenly.
There are two performance properties that can affect your understanding of how Karafka operates: throughput and latency.
As far as I understand, all messages are being produced. You could try playing with the Karafka settings, in particular this one: https://github.com/karafka/karafka/blob/83a9a5ba417317495556c3ebb4b53f1308c80fe0/lib/karafka/setup/config.rb#L114
Logger that is being used flushes data from time to time, so you won't see it immediately but after a bit of time. You can validate this by looking at the log time.