amazon-web-servicesamazon-sqs

Is there any way to get the Receipt Handle(s) for messages in flight for an SQS queue?


So - we have a cloud watch alarm for the age of the oldest message in an SQS queue. Basically to alert us if consumers aren't doing their job/keeping up. It started going bonkers today because someone put in a message with an extremely long timeout visibility, consumed it, but then didn't delete it (because the process crashed) - so it's in-flight.

We no longer have the receipt handle for that message, and there doesn't seem to be any way I can see to get it once it has been consumed. I know in some sense that's to protect the whole process so one consumer doesn't bork another, but it seems nuts there's no "break glass in case of emergency" here -> especially when there IS a way to change the message visibility if you somehow do have the receipt handle.

Is there really NO way to get the handle(s) for messages in flight? Right now I have to basically turn that alarm off for a few days, which leaves us somewhat blind to an important metric.


Solution

  • I'd love for someone to prove me wrong on this, but my answer so far...

    BLUF - it isn't possible. Its by design, whether you agree with it or not.

    Longer answer:

    The main idea here is that SQS never wants two consumers processing the same message. So by design once a consumer pulls the message, no one else can know the receipt handle of (or really anything about) that message. This prevents some hacky work-around where a second consumer tries to requeue and process that same message.

    I disagree with AWS that there isn't some kind of admin override here, but... My opinion doesn't matter much.

    The idea of a DLQ doesn't really address the main issue. Most of the time this is proposed because people assume there is a fault in the consumer that causes an infinite loop. I.e. message is placed into queue, consumer pulls (placing message in-flight), consumer fails to process message, visibility timeout is reached (placing message back into queue), consumer picks message up again and fails again.... Lather, rinse, repeat.

    In this case, after a set number of retries, the message can be placed into a DLQ, at which point someone can view the message and investigate why the consumer is failing on it, and possibly requeue it after the bug is fixed.

    This DOESN'T address the main problem in this question though - what happens when someone puts a message into the queue with an obscenely long visibility timeout, and the consumer fails. E.g. what if the visibility timeout is 1 year. Of course this is a bad idea, but lets say it happens. At that point, there's literally no way to get that message out of in-flight for a year. Which in turn makes the CloudWatch metric "age of oldest message in queue" (I'd argue a very useful metric to use and alarm on) utterly useless.

    Unfortunately, there's no way around this.

    Except... there is a nuclear option... https://docs.aws.amazon.com/cli/latest/reference/sqs/purge-queue.html. Purging the queue will remove all messages from the queue, including those in-flight. They are NOT returned for reprocessing, and you can not select which messages to purge. However, if your queue is generally low volume, and like us you have an alarm on the oldest message in queue that's important but way too noisy because of this, it is possible to just "zero everything out".