I would like the slurm
workload manager to do some action like touch stopped.txt
at job termination either due to time out or failure. How can this be done?
When the job has terminated, there is no way for regular users to perform further actions. (Admins can use strigger
or setup epilog scripts)
For termination due to time out, the typical course of action is to setup a Bash "trap" to catch a signal and request Slurm to send that signal a few minutes before the job is killed.
For termination due to failure, you can test the return code of your main program inside the submission script and act accordingly.
Another option, which could be seen as overkill, but is easier to implement, is to submit a "monitoring" job, dependent on the job after which some action must be taken, and have that job create the stopped.txt
file based on the state of the job in the accounting.