I'm working on a project that uses Phobos to consume messages from Kafka. As part of migrating the project to new infrastructure, i've started seeing a significant increase in timeouts when Phobos makes API calls to Kafka. It isn't 100% of the time, but it is significant. I'm troubleshooting with infrastructure engineers to understand why these calls are taking so much longer on the new infrastructure.
In the mean time, i'm looking to see if there is a way to increase the timeouts for the API requests that are made as part of Phobos/Ruby-Kafka library. Some example requests that are occuring:
Sending join_group API request
[join_group] Waiting for response
[join_group] Timed out while waiting for response
There are properties for session_timeout
and connect_timeout
but those seem to only be related to the connection to Kafka itself. But these individual API calls that are made while the project is running are timing out significantly sooner than these values.
ex-contributor to ruby-kafka here.
First of all, please keep in mind that ruby-kafka is no longer supported or maintained. This project has been superseded by https://github.com/appsignal/rdkafka-ruby/
It means you might have found an edge case that is a bug or an issue. Aside from that, this particular timeout usually occurs for remote clusters where the latency is much higher. The default socket_timeout for ruby-kafka was 20 seconds but you may try to increase it and connect_timeout to a higher value.
If that does not help and/or cluster is not over the network one, it may also indicate permission problems (though less likely).
If anything, I would recommend you to try out one of the other frameworks that are C librdkafka-based, like https://github.com/karafka/karafka, or try to connect and consume from this cluster by using https://github.com/appsignal/rdkafka-ruby/ itself. You will not get much help for ruby-kafka issues if this one turns out to be related to it.