amazon-web-servicesslurmamazon-parallelcluster

How to stop a compute node with SLURM?


I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :


Solution

  • Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html

    Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.

    You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.