pythonparallel-processingslurmjoblibsupercomputers

What is causing my random: "joblib.externals.loky.process_executor.TerminatedWorkerError" errors?


I'm making GIS-based data-analysis, where I calculate wide area nation wide prediction maps (e.g. weather maps etc.). Because my target area is very big (whole country) I am using supercomputers (Slurm) and parallelization to calculate the prediction maps. That is, I split the prediction map into multiple pieces with each piece being calculated in its own process (embarrassingly parallel processes), and within each process, multiple CPU cores are used to calculate that piece (the map piece is further split into smaller pieces for the CPU cores).

I used Python's joblib-library for taking advantage of the multiple cores at my disposal and most of the time everything works smoothly. But sometimes, randomly with about 1.5% of the time, I get the following error:

Traceback (most recent call last):
  File "main.py", line 557, in <module>
    sub_rasters = Parallel(n_jobs=-1, verbose=0, pre_dispatch='2*n_jobs')(
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
    self.retrieve()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/root_path/conda/envs/geoconda-2021/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGBUS(-7)}

What causes this problem, any ideas? And how to make sure this does not happen? This is irritating because, for example, if I have 200 map pieces being calculated and 197 succeed and 3 have this error, then I need to calculate these 3 pieces again.


Solution

  • Q :
    " What causes this problem, any ideas? - I am using supercomputers "

    A :
    a)
    Python Interpreter process ( even if run on supercomputers ) is living in an actual localhost RAM-memory.
    b)
    given (a), the number of such localhost CPU-cores controls the joblib.Parallel() behaviour.
    c)
    given (b) and having set n_jobs = -1 and also pre_dispatch = '2*n_jobs' makes such Python Interpreter start requesting that many loky-backend specific separate processes instantiations, as an explicit multiple of such localhost number of CPU-cores ( could be anywhere from 4, 8, 16, ..., 80, ... 8192 - yes, depends on the actual "supercomputer" hardware / SDS composition )
    d)
    given (c), each such new Python Interpreter process ( being there anywhere between 8, 16, 32, ..., 160, ... 16384 such new Python Interpreter processes demanded to be launched ) requests a new, separate RAM-allocation from the localhost O/S memory manager
    e)
    given (d) such accumulating RAM-allocations ( each Python Process may ask for anything between 30 MB - 3000 MB of RAM, depending on the actual joblib-backend used and the memory (richness of the internal state) of the __main__-( joblib.Parallel()-launching )-Python Interpreter ) may easily and soon grow over physical-RAM, where swap starts to emulate the missing capacities by exchanging blocks of RAM content between physical RAM and disk storage - that at costs about 10,000x - 100,000x higher latencies, than if it were not forced into such a swapping virtual-memory capacities emulation of the missing physical-RAM resources
    f)
    given (e) "supercomputing" administration often prohibits over-allocations by administrative tools and kills all processes, that tried to oversubscribe RAM-resources beyond some fair-use threshold or user-profiled quota
    e)
    given (e) and w.r.t. the documented trace:

    ...
    joblib.externals.loky.process_executor.TerminatedWorkerError:
           A worker process managed by the executor
           was unexpectedly terminated. This could be caused
               by a segmentation fault while calling the function
             or
               by an excessive memory usage
           causing the Operating System to kill the worker.
    

    the above inducted chain of evidence was confirmed to be (either) a SegFAULT (not being probable in Python Interpreter realms) or deliberate KILL, due to "supercomputer" Fair Usage Policy violation(s), here due to excessive memory usage.

    For SIGBUS(-7) you may defensively try avoid Lustre flushing and revise details about mmap-usage, potentially trying to read "beyond EoF", if applicable:

    By default, Slurm flushes Lustre file system and kernel caches upon completion of each job step. If multiple applications are run simultaneously on compute nodes (either multiple applications from a single Slurm job or multiple jobs) the result can be significant performance degradation and even bus errors. Failures occur more frequently when more applications are executed at the same time on individual compute nodes. Failures are also more common when Lustre file systems are used.

    Two approaches exist to address this issue. One is to disable the flushing of caches, which can be accomplished by adding "LaunchParameters=lustre_no_flush" to your Slurm configuration file "slurm.conf".

    Consult Fair Usage Policies applicable with your "supercomputer" Technical Support Dept. so as to get valid ceiling details.

    Next refactor your code not to pre_dispatch that many processes, if still would like to use the strategy of single-node process-replication, instead of other, less RAM-blocking, more efficient HPC computing strategy.