Tech stack - Python, Rabbitmq, Elasticsearch, Heroku
I have one application which adds content in the app based on a certain schedule and when the content is added, an email needs to be triggered to some users based on some criteria ( can be around 1million users)
The design I have decided is that first application(producer) will publish the content and then put a message in Rabbitmq(message broker) containing user id and email address for each user who is supposed to get the email, so basically adding around 1mn messages to Rabbitmq and then the email will be sent by email application(consumer)
My question is around the producer application.It is going to get a list of user ids from elasticsearch and start adding them to the queue using a for loop, but if for some reason this application goes down(like in case of new deployment), then only few users will be added to the queue and when the application comes back it will again start queueing and we might end up with duplicate emails to the same users. Is there a standard way to avoid this.Seems like a very common issue.
I am thinking of maintaining the message acknowledgement(for each user for given content)from consumer app within my producer app database.But it feels like that it might lead to records explosion in my PostgreSQL database after a few email sends.Each content can be followed by email trigger to around 1million user.
You are correct that it's a very common issue. Unfortunately, in general, when sending messages between asynchronously executing processes, you cannot guarantee exactly-once delivery of a message: you have to choose between at-most-once and at-least-once.
You can guarantee at-most-once (which trivially prevents duplicates), by having the producer application never retry a send (even if the producer fails). That may not be what you want, however.
You can guarantee at-least-once by having the consumer acknowledge to the producer that it has received and acted upon the sent message; the producer maintains state (e.g. in a database) that a given message hasn't been acknowledged and retries unacknowledged messages after some time. Note that if the consumer acknowledges after performing the actions, it's possible that the actions (i.e. the email being sent) get at least partially duplicated (consider what happens if the consumer crashes between starting to do the thing and acknowledging that it's done) and if the consumer acknowledges before performing the actions, you've made it into at-most-once delivery (consider the case where the consumer crashes after acknowledgement but before fully doing the thing).
By having the consumer maintain state containing messages its acknowledged, the consumer can deduplicate messages: if it receives a message it's already acknowledged, it acknowledges it again and doesn't do anything else. It's still at-least-once, but the rate of duplicates is substantially reduced (at some expense: every message now entails potentially 3 DB writes (producer writes message unacknowledged, consumer and producer each write acknowledgement), though those need not be to the same DB.