I just switched from ForkPool to gevent with concurrency (5) as the pool method for Celery workers running in Kubernetes pods. After the switch I've been getting a non recoverable erro in the worker:
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
The broker logs gives basically the same message:
2021-11-01 22:26:17.251 [warning] <0.18574.1> Consumer None4 on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
I have the CELERY_ACK_LATE
set up, but was not familiar with the necessity to set a timeout for the acknowledgement period. And that never happened before using processes. Tasks can be fairly long (60-120 seconds sometimes), but I can't find a specific setting to allow that.
I've read in another post in other forum a user who set the timeout on the broker configuration to a huge number (like 24 hours), and was also having the same problem, so that makes me think there may be something else related to the issue.
Any ideas or suggestions on how to make worker more resilient?
For future reference, it seems that the new RabbitMQ versions (+3.8) introduced a tight default for consumer_timeout
(15min I think).
The solution I found (that has also been added to Celery docs not long ago here) was to just add a large number for the consumer_timeout
in RabbitMQ.
In this question, someone mentions setting consumer_timeout to false, in a way that using a large number is not needed, but apparently there's some specifics regarding the format of the configuration for that to work.
I'm running RabbitMQ in k8s and just done something like:
rabbitmq.conf: |
consumer_timeout = 31622400000