I'm using Rails with Karafka gem. In one of the topic the traffic will be huge, so i added 10 partitions to the topic, also i set the Karafka::App.config.concurrency to 10. But this isn't giving me a huge difference in time. To further scale our consumer processing, I'm considering adding separate consumer instances in our staging environment.
Can I add separate consumer instances in Karafka to further reduce the processing time?
Karafka gem version: 2.1.11, Rails version: 7.0.5
The karafka.rb file looks something like:
class KarafkaApp < Karafka::App
kafka_config = Settings.kafka
max_payload_size = 7_000_000 # Setting max payload size as 7MB
t1_topic = kafka_config['topics']['t1_topic']
t2_topic = kafka_config['topics']['t2_topic']
setup do |config|
config.kafka = {
'bootstrap.servers': ENV['KAFKA_BROKERS_SCRAM'],
'max.poll.interval.ms': 1200000
}
config.client_id = kafka_config['client_id']
config.concurrency = 10
# Recreate consumers with each batch. This will allow Rails code reload to work in the
# development mode. Otherwise Karafka process would not be aware of code changes
config.consumer_persistence = !Rails.env.development?
config.producer = ::WaterDrop::Producer.new do |producer_config|
# Use all the settings already defined for consumer by default
producer_config.kafka = ::Karafka::Setup::AttributesMap.producer(config.kafka.dup)
# Alter things you want to alter
producer_config.max_payload_size = max_payload_size
producer_config.kafka[:'message.max.bytes'] = max_payload_size
end
end
routes.draw do
topic t1_topic.to_sym do
consumer C1Consumer
end
topic t2_topic.to_sym do
config(
partitions: Karafka::App.config.concurrency
)
consumer C2Consumer
end
end
end
Karafka author here.
Your description needs to include some critical information, especially about your data and processing nature. Without that, it isn't easy to provide specific advice, but I'll do my best.
First, be aware that heavily IO-based operations may only achieve a tenfold increase (or similar) per process without fully utilizing the CPU in the case you have described. For instance, processing one message per second means a best-case scenario of 10 messages per second, even if everything works as expected. There are ways to further increase this, and I will give you some explanation below.
Understanding Karafka's parallel processing and ordering guarantees is crucial. To prevent capability reduction, parallel data polling/receiving and parallel work distribution need to be synchronized.
Karafka can handle up to 100k messages per second under optimal conditions.
Another point is how Karafka manages multiple partitions within a single process. By default, it reduces Kafka connection numbers by round-robin partition data into a single queue. Excessive prebuffering per partition can hinder parallel processing since a single batch from the queue may not offer enough data from multiple partitions for effective parallelism. This misunderstanding often arises from comparisons with Sidekiq. However, Karafka offers several mitigation strategies.
Regarding adding 10 partitions and setting Karafka::App.config.concurrency
to 10 without significant improvement, this is expected if prebuffering is extensive. In such cases, Karafka cannot utilize all threads efficiently, especially when catching up with a backlog. This might change with a smaller lag, leading to better data distribution and increased parallelism.
Adding separate consumer instances in Karafka can indeed reduce processing times. Karafka supports various methods to enhance parallelism and performance, depending on your work's nature, topic structure, and count. Here are some options:
Settings Tuning: An accessible approach is to adjust settings found at https://karafka.io/docs/Librdkafka-Configuration/. For example, reducing fetch.message.max.bytes
from the default 1MB can increase parallelism for small messages by preventing excessive data accumulation per partition.
Running a Swarm: Karafka's swarm mode, detailed at https://karafka.io/docs/Swarm-Multi-Process/, allows parallel data polling by utilizing multiple forks and connections.
Independent Subscription Groups: Using multiple subscription groups for different topics can improve round-robin and queuing across all your topics. More on this at https://karafka.io/docs/Routing/#multiple-subscription-groups-mode.
Virtual Partitions: These enhance data processing parallelism from a single partition, significantly boosting throughput for IO-intensive tasks. Details at https://karafka.io/docs/Pro-Virtual-Partitions/.
Connection Multiplexing: Karafka supports multiplexing for a single-topic, single-process setup, further described at https://karafka.io/docs/Pro-Multiplexing/.
Long-Running / Non-Blocking Jobs: This approach, which doesn't block polling while processing a topic partition, can efficiently cycle through partitions. See https://karafka.io/docs/Pro-Long-Running-Jobs/ for limitations and use cases.
Karafka's concurrency model is thoroughly explained here: https://karafka.io/docs/Concurrency-and-Multithreading/.
This overview should give you a clearer path to enhancing your Karafka setup.
P.S. We have a slack where you can get real-time answers from me and other heavy Karafka users. People use it in scale of 100s of thousands of messages per second and it works well when properly configured.
P.S. 2. If you plan to bet your business on Karafka at some point, it may be worth investing in the Pro offering which includes application-specific support and architectural consultation + a lot of useful features.