When running a Python script via slurm srun --pty bash
I get a cryptic error message Bus error: core dumped
.
I searched the slurm documentation and it doesn't mention this error type.
What's going on and how can I fix it?
I found this general information on the bus error
, but that doesn't explain how and why it happens in a SLURM environment and what can be done to avoid it: What is a bus error? Is it different from a segmentation fault?
In at least one case, this was probably due to my job requiring too much memory and thus getting killed by SLURM.
I had submitted the job with 32GB memory, and the core dump was 33GB, so I'm pretty sure in this case it was killed due to requiring too much memory.
Helpful answer from Ben Evans on the Yale cluster Discourse that may apply more generally to other clusters:
On the Yale clusters, a bus error usually means your job ran out of memory (RAM). If you cannot reduce the memory usage of your code, you can request additional memory for your job using the --mem-per-cpu or --mem Slurm flags.
More details: Your program can run into this fault because of the way we manage memory with cgroups 7 so that many jobs can be run on the same physical machine without interfering with one another. If a process inside a job tries to access memory “outside” what was allocated to that job, e.g. more than what you requested, the operating system tells your program that address is invalid with the fault Bus Error, aka SIGBUS, exit(10). A similar fault you might be more familiar with is a Segmentation Fault, aka SIGSEGV, exit(11) which usually results from a program incorrectly trying to access a valid memory address.
https://ask.cyberinfrastructure.org/t/what-does-it-mean-when-i-get-a-bus-error-in-my-job/1101/2