node.jsamazon-web-servicesaws-lambdaserverless-frameworkaws-sqs-fifo

AWS SQS FIFO message doesn't seem to be retrying


I've set up a function to hit an API endpoint for a newly created entity that isn't immediately available. If the endpoint returns a status of "pending", the function throws an error. If the endpoint returns a status of "active", the function then deletes the SQS message and triggers several other microservices to do their things using SNS. The SQS queue that triggers the function has a visibility timeout of 2 minutes, and the function itself has a 1 minute timeout.

What I'm expecting to happen is that if the endpoint returns a "pending" status, and the function throws an error, then after the 2 minute visibility timeout, the message would trigger the function again. This should happen every 2 minutes until the api call returns an "active" status and the message is deleted, or until the message retention period is surpassed (currently 1 hour). This seemed like a nice serverless way to poll my newly created entity to check if it was ready for other post-processing.

What's actually happening after adding a message to the SQS queue is that the CloudWatch logs are showing that the function is throwing an error like I'd expect, but the function is only being triggered one time. I can't tell if the message is just not not visible for some reason, or if it somehow was deleted. I don't know. I'm am new to using SQS for a Lambda trigger, am I thinking about this wrong?


Solution

  • A few possible causes here:

    1. your Lambda function handler did not actually throw an exception to the Lambda runtime environment, so Lambda thought the function had successfully processed the message and the Lambda service then deleted the message from the queue (so that it would not get processed again)
    2. your SQS queue has a configured DLQ with maximum receives set to 1, so the message is delivered once, the Lambda function fails, and the message is subsequently moved to the DLQ
    3. the SQS message was re-delivered to the Lambda function and was logged but the logs were made to an earlier log stream (because this invocation was warm) and so it wasn't obvious that the Lambda function had actually been invoked multiple times with the same failed message

    To verify this all works normally, I set up a simple test with both FIFO and non-FIFO queues and configured the queues to trigger a Lambda function that simply logged the SQS message and then threw an exception. As expected, I saw the same SQS message delivered to the Lambda function every 2 minutes (which is the queue's message visibility timeout). That continued until it hit the max receive count on the SQS redrive policy (defaults to 10 attempts) at which point the failed message was correctly moved to the associated DLQ.