pythonpublish-subscribegoogle-cloud-pubsubwebsub

How to rule out duplicate messages in Pubsub without using Dataflow and without using ACK in Python?


I have a use-case wherein I want to read the messages in Pubsub without acknowledging the messages. I would need help in on how to rule out the possibility of "duplicate messages" which will remain in Pubsub store when I don't ACK the delivered message.

Solutions that I have thought of:

  1. Store the pulled messages in Datastore and see if they are same.
  2. Store the pulled messages at runtime and check if my message is duplicate O(n) time complexity and space complexity O(n).
  3. Store the pulled messages in a file and compare the new incoming messages from the messages in the file.
  4. Use Dataflow and rule out the possibility (least expected)

I see that there is no offset like feature in Pubsub which is similar to Kafka, I think.

Which is the best approach that you would suggest in this matter/ or any other alternative approach that I can use?

I am using python google-cloud-pubsub_v1 to create a python client and pulling messages from Pubsub.

I am sharing the code which is the logic to pull the data

subscription_path = subscriber.subscription_path(
    project_id, subscription_name)
    NUM_MESSAGES = 3

    # The subscriber pulls a specific number of messages.
    response = subscriber.pull(subscription_path, max_messages=NUM_MESSAGES)

    for received_message in response.received_messages:
        print(received_message.message.data)


Solution

  • It sounds like Pub/Sub is probably not the right tool for the job. It seems as if you are trying to use Pub/Sub as a persistent data store, which is not the intended use case. Acking is a fundamental part of the lifecycle of a Cloud Pub/Sub message. Pub/Sub messages are deleted if they are unacked after the provided message retention period, which cannot be longer than 7 days.

    I would suggest instead that you consider using an SQL database like Cloud Spanner. Then, you could generate a uuid for each message, use this as the primary key for deduplication, and transactionally update the database to ensure there are no duplicates.

    I might be able to provide a better answer for you if you provide more information about what you are planning on doing with the deduplicated messages.