rslurmhpcstan

Slurm Job is Running out of Memory [RAM?] but memory limit not reached


I run simulations on a hpc-cluster which are quite memory demanding. I'm fitting cmdstan models with 3000 iterations for different conditions(200 unique combinations). To do this, I'm using the simDesign package in R.

The simulations run perfectly fine, with outputs as expected when I run it with a low number of replications (e.g. 10). For testing, I now wanted to run one condition-row with 100 reps (this will be the real case). But after approx. 1 hour, my node runs out of memory:

sh: /rds2874z4733/temp/ no space left on device
sh: /rds2874z4733/temp/ no space left on device

When I monitor my job after canceling, I see that the allocated memory is not yet depleted (even though it would not be a sufficient amount of memory in the end):

State: CANCELLED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 3-06:35:52
CPU Efficiency: 93.73% of 3-11:51:28 core-walltime
Job Wall-clock time: 01:18:37
Memory Utilized: 33.82 GB
Memory Efficiency: 39.13% of 86.43 GB

I also tried to allocate more memory and more ram for my node, but this does not solve the problem. As I will fit 100 cmdstan models per condition, I also tried to free memory within the fitting function, by doing this:

.....
 # Stan is noisy, so tell it to be more quiet()
  M3 <-  quiet(mod$sample(dat,
                          refresh = 0,
                          chains = 4,
                          #parallel_chains=4, 
                          iter_warmup=n_warmup,
                          iter_sampling=n_iter,
                          adapt_delta=adapt_delta,
                          max_treedepth=max_treedepth,
                          init = init,
                          show_messages = FALSE))
  
  M3_hyper <- M3$summary(c("hyper_pars","mu_f"), mean,Mode,sd,rhat,HDInterval::hdi)
  M3_subj <- M3$summary(c("subj_pars"), mean,sd,rhat,Mode,HDInterval::hdi)
  M3_f <- M3$summary(c("f"), mean,sd,Mode,rhat,HDInterval::hdi)
  M3_count_rep <- M3$summary(c("count_rep"),mean)
  M3_omega <- M3$summary("cor_mat_lower_tri",mean)
  
  M3_sum <- list(M3_hyper,M3_subj,M3_f,M3_count_rep,M3_omega)
  rm(M3)
  gc(full = T)
  
  return(M3_sum)

But this does not solve the problem. On every iteration, this data is saved and when the number of iterations are reached, it is summarised. This runs in parallel, as the package takes care of this. I do not save the iteration results, but the summarised results at the end of the simulation. As I will simulate 200 Conditions with 100 reps each, I need to solve this issue either way. I will definetly run 1 or 2 conditions on different nodes, so it will be at least 2500 models for each node....

Has anybody experiences with the simDesign package or slurm ram allocation and can give me some advice ? I'm relatively new in coding on a cluster, so I appreciate any advice !

cheers *

jan

Here is the Jobscript for the slave conditions:

#!/bin/bash

#SBATCH -A  acc          # Account
#SBATCH -p parallel      # Partition: parallel, smp, bigmem
#SBATCH -C skylake       # architecture Skylake (64 Cores) or Broadwell (40 Cores)  
#SBATCH -n 1                     # number of tasks
#SBATCH -N 1             # allocate one full node   
#SBATCH --ramdisk=100G       # Reserve sufficient space for job on ramdisk  
#SBATCH -t 02:30:00              # Run time (hh:mm:ss)


## Default Output 
WD="/prjtdir/M3-simulations/"

## Move job to Ramdisk for sufficient space
JOBDIR="/localscratch/${SLURM_JOB_ID}/"
RAMDISK=$JOBDIR/ramdisk

module purge # ensures vanilla environment
module load lang/R # will load most current version of R

cp $WD/sim3.R $RAMDISK
cp -R $WD/Functions $RAMDISK
cp -R $WD/Models $RAMDISK

## Change Dir to Jobfolder
cd $RAMDISK

# Run Script
srun Rscript sim3.R -N $1 -K $2 -F $3 -R $4 -P $5 -I ${SLURM_JOB_ID} -D ${WD}

And here is an excerpt of sinfo - I usually use the parallel partition with 64 cores per node

sinfo -Nel -p parallel
Sun Aug 07 01:23:29 2022
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
x0001          1  parallel    drained* 64     2:16:2  88500        0      6 anyarch, RBH_OPAFM
x0002          1  parallel     drained 64     2:16:2  88500        0      6 anyarch, RBH_OPAFM
x0003          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0004          1  parallel     drained 64     2:16:2  88500        0      6 anyarch, SlurmdSpoolDir is fu
x0005          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0006          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0007          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0008          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0009          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0010          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none

Here is the actual error I get after aprox. 40 - 60 minutes (depending on condition)

Design row: 1/1;   Started: Sun Aug  7 00:44:31 2022;   Total elapsed time: 0.00s 
sh: /tmp/RtmpyFfzBI/file9eed87b8b8e1c: No space left on device

Solution

  • I was able to fix the problem to defining the TMPDIR also on the scratch space:

    ## Move job to Ramdisk for sufficient space
    JOBDIR="/localscratch/${SLURM_JOB_ID}/"
    TMPDIR=$JOBDIR
    
    module purge # ensures vanilla environment
    module load lang/R # will load most current version of R
    
    cp $WD/sim3.R $JOBDIR
    cp -R $WD/Functions $JOBDIR
    cp -R $WD/Models $JOBDIR
    
    ## Change Dir to Jobfolder
    cd $JOBDIR
    
    # Run Script
    srun Rscript sim3.R -N $1 -K $2 -F $3 -R $4 -P $5 -I ${SLURM_JOB_ID} -D ${WD}
    

    Runs with double the iterations without additional space in every condition.