amazon-web-serviceshpcsungridenginespot-instances

How to revive / resubmit stuck SGE jobs ( Usage of qsub? )


I am trying to revive/resubmit stuck jobs (which run on an SGE scheduler) due to a node crash or say AWS spot instances being taken away? Can someone help in resuming such jobs? I have been trying to understand the usage of qsub but not able to configure something that will automatically resubmit such jobs.

Also unable to configure my queue using qconf command as only root & sge_admin users can run this command, I do have root-privileges but asks me to set the SGE_ROOT environment variable, which I did but still keeps throwing the error to set the variable.

Any sort of assistance would be highly appreciated.


Solution

  • From the qsub man page:

      -r y[es]|n[o]
           Available for qsub and qalter only.
    
           Identifies the ability of a job to be rerun or not.  
           If the value of -r is 'yes', the job will be rerun if the job was 
           aborted without leaving a consistent  exit state.  
    
           (This is typically the case if the node on which the job is running
           crashes).  If -r is 'no', the job will not be rerun under any circumstances.
           Interactive jobs submitted with qsh, qrsh or qlogin are not rerunnable.
    
           Qalter allows changing this option even while the job executes.
    

    So adding

    #$ -r y
    

    in your job script should cater for this.