deep-learningpytorchcluster-computingslurmpytorch-lightning

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster


I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:

sbatch --gpus=1 -t 100 python train.py

When requested GPU time ends, slurm kills my program and shows such message:

Epoch 0: : 339it [01:10,  4.84it/s, loss=-34]  slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT *** 

How can I configure a Trainer to save model when available time end?

I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.


Solution

  • You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1@30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:

    #SBATCH -t 100
    #SBATCH --signal=USR1@30
    srun python train.py
    

    Then, in your code, you can handle that signal like this:

    import signal
    
    def handler(signum, frame):
        print('Signal handler got signal ', signum)
        # e.g. exit(0), or call your pytorch save routines
    
    # enable the handler
    signal.signal(signal.SIGUSR1, handler)
    
    # your code here
    

    You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))

    Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.