slurm

SLURM interactive job assigned to the worker nodes but effectively is running on the login node


when I start an interactive job is activated in one of the worker nodes - I see it in the logs on the terminal and running squeue, but then when I run my commands in the terminal they are using the RAM/CPUs of the login node. I checked it using both htop and glances.

salloc -c 2 -t 10:00:00 --mem=4G

What should I check in the slurm configuration file? I expect that should not happen this.

The "login" node is not in the configuration file.

### Nodes
NodeName=node1 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=480000
NodeName=node2 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=480000

### Partitions
PartitionName=standard Nodes=node1,node2 Default=Yes MaxCPUsPerNode=76 MaxTime=INFINITE State=UP

When I run with sbatch everything go well.


Solution

  • The salloc command will or will not connect to the (first) node in the allocation depending on the value of the LauchParameters option in slurm.conf's use_interactive_step.

    If set, it will

    Have salloc use the Interactive Step to launch a shell on an allocated compute node rather than locally to wherever salloc was invoked.

    Otherwise you have the behaviour you observe