symfonyamazon-sqsreliability

Can Messenger's consume command be allowed to recover from SQS message read failures?


I'm working on a simple microservice we've put together to queue and send emails. In live environments the queue uses SQS at the moment, via the latest Symfony Messenger component (v5.2.x) and its SQS bridge.

This mostly works nicely, but occasionally (every few weeks roughly) we've seen SQS return a rogue 500 server error to the consumer/worked, which is an ECS Service running Messenger's off-the-shelf ConsumeMessagesCommand. The error causes the consumer to exit completely – not the end of the world as ECS spins up another, but it feels like we should be able to do better!

The last trace I looked at was with Messenger v5.1.5 but I don't think the Messenger code involved has changed substantively since. The error is from AmazonSqsReceiver::get() on this line and the consumer app crashes reporting a PHP Fatal error: Uncaught AsyncAws\Core\Exception\Http\ServerException. I've pasted the full trace with log timestamps at the bottom of this question.

Since ServerException implements HttpException which is caught, as far as I can tell the code is throwing a Symfony-native TransportException next, but passes in the original AWS exception for Messenger to handle as it sees fit – and then something (I've not managed to figure this out exactly yet) seems to re-throw that later, leading to the fatal unhandled exception.

It feels like maybe there is a different behaviour that could be used instead of forcing an exit to ConsumeMessagesCommand, perhaps by configuring a slightly different Receiver, or by proposing a change to how the SQS one handles this out the box if there's agreement that something else is better for most use cases. I'm happy to attempt to work on the latter but feel my understanding of some of Messenger's classes and their intended use is a bit tenuous for that so far. I noticed the new RecoverableExceptionInterface added recently, but I don't know if using it for a Receiver like this is within the intended scope.

I've had a quick look at extending AmazonSqsReceiver to tweak only get() without maintaining a totally separate Receiver, but since properties like Connection are private this gets messy fast.

I think my ideal outcome in the error case would be either:

or

Any ideas much appreciated – on whether this is behaving as designed, and what I might do to work around it if so!

2020-11-29T09:14:04.977+02:00   [29-Nov-2020 07:14:04 UTC] PHP Fatal error: Uncaught AsyncAws\Core\Exception\Http\ServerException: HTTP 500 returned for "https://sqs.eu-west-1.amazonaws.com/".
2020-11-29T09:14:04.977+02:00   Code: InternalError
2020-11-29T09:14:04.977+02:00   Message: We encountered an internal error. Please try again.
2020-11-29T09:14:04.977+02:00   Type: Receiver
2020-11-29T09:14:04.977+02:00   Detail:
2020-11-29T09:14:04.977+02:00   in /var/www/html/vendor/async-aws/core/src/Response.php:358
2020-11-29T09:14:04.977+02:00   Stack trace:
2020-11-29T09:14:04.977+02:00   #0 /var/www/html/vendor/async-aws/core/src/Response.php(117): AsyncAws\Core\Response->getResolveStatus()
2020-11-29T09:14:04.977+02:00   #1 /var/www/html/vendor/async-aws/core/src/Result.php(63): AsyncAws\Core\Response->resolve(0.1)
2020-11-29T09:14:04.977+02:00   #2 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(202): AsyncAws\Core\Result->resolve(0.1)
2020-11-29T09:14:04.977+02:00   #3 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(193): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->fetchMessage()
2020-11-29T09:14:04.977+02:00   #4 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(165): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->getNewMessages()
2020-11-29T09:14:04.977+02:00   #5 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/Connection.php(152): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->getNextMessages()
2020-11-29T09:14:04.977+02:00   #6 /var/www/html/vendor/symfony/amazon-sqs-messenger/Transport/AmazonSqsReceiver.php(44): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\Connection->get()
2020-11-29T09:14:04.977+02:00   #7 /var/www/html/vendor/symfony/messenger/Worker.php(74): Symfony\Component\Messenger\Bridge\AmazonSqs\Transport\AmazonSqsReceiver->get()
2020-11-29T09:14:04.977+02:00   #8 /var/www/html/vendor/symfony/messenger/Command/ConsumeMessagesCommand.php(197): Symfony\Component\Messenger\Worker->run(Array)
2020-11-29T09:14:04.977+02:00   #9 /var/www/html/vendor/symfony/console/Command/Command.php(258): Symfony\Component\Messenger\Command\ConsumeMessagesCommand->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #10 /var/www/html/vendor/symfony/console/Application.php(916): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #11 /var/www/html/vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application->doRunCommand(Object(Symfony\Component\Messenger\Command\ConsumeMessagesCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #12 /var/www/html/vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
2020-11-29T09:14:04.977+02:00   #13 /var/www/html/mailer-cli.php(18): Symfony\Component\Console\Application->run()
2020-11-29T09:14:04.977+02:00   #14 {main}
2020-11-29T09:14:04.977+02:00   thrown in /var/www/html/vendor/async-aws/core/src/Response.php on line 358
2020-11-29T09:14:04.981+02:00
Script php mailer-cli.php messenger:consume -vv --time-limit=86400 handling the messenger:consume event returned with error code 1

Solution

  • It's now looking like this can be fixed by using the latest stable aws-async/sqs and aws-async/core (in particular v1.7.0 or newer of the latter), without changes to Symfony Messenger itself.

    After I tried to patch Messenger in a PR, @jderusse – who I think has worked on the above libraries – suggested this should resolve the blips by using RetryableHttpClient.

    Since the lib's standard retry strategy includes repeating failed calls that get HTTP 500 responses this seems like it should catch the edge case and is likely the best fix.

    We already had the library updates on our development branch, so will prioritise releasing the changes live to verify.

    Edit: I can confirm this sorted it with no app code changes.