I have the below workflow:
Deleting
, starts a background process and returns ok to the caller. There is a separate job table that keeps track of these background process & its state.Deleted
or if there is an err, state becomes DeleteFailed
. This status will be reflected in UI.Note: Once the state is Deleting, delete button is disabled on UI. If DeleteFailed, Delete button is enabled again & the caller can call Delete API again
Problem:
If there is an infra/db failure during background thread execution, the resource will continue to be in Deleting state forever.
Solution:
Infra failure - have added Job recovery code on node startup (from main method) which will check if there are jobs in running state and if so rerun the job. Skipping implementation details here. This is solved
DB call failure - In this case, the job completes (could be success/failure), but the DB record is left in hanging state (Deleting). How to solve this ?
Thoughts? Any better solution? There should be some standard way to solve this as it seems like a common problem.
I would consider the following basic things when designing such a flow:
In your particular case, If your db failures are in the master, you can do a atomic transaction and detect it later as a db commit failure. If it is in a subpart and you want the state to be in itself, then that subpart would need to handle failure.
Cleanup threads doing a scan may not be a great idea if you dont have a way to get the "stuck" operations efficiently. In that case a dead letter queue would be simpler.