pythonjupyter-notebookmultiprocessingpyzmqpapermill

zmq.error.ZMQError: Address already in use, when running multiprocessing with multiple notebooks using papermill


I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.

This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.

However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-124>", line 2, in initialize
  File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
    self.init_sockets()
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

along with:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "main.py", line 77, in run_papermill
    pm.execute_notebook(notebook, output_path, parameters=config)
  File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
    **engine_kwargs
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
    nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
    preprocessor.preprocess(nb_man, safe_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
    with self.setup_preprocessor(nb_man.nb, resources, km=km):
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
    self.km, self.kc = self.start_new_kernel(**kwargs)
  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
    kc.wait_for_ready(timeout=self.startup_timeout)
  File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
    raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info

Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.

This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.

I have tried changing the start method and updating the libraries but to no avail.

The version of my libraries are:

papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3

Thank you!


Solution

  • The clear problem attribution points towards ZeroMQ unable to successfully .bind().

    The error message : zmq.error.ZMQError: Address already in use is easier to explain. Whereas ZeroMQ AccessPoint-s can, for obvious reasons freely try to .connect() to many counterparts, yet one and only one can .bind() onto a particular Transport Class' Address Target.

    There are three potential reasons for this happening:

    1 ) an accidental calling some code (without knowing internal details)
    via { multiprocessing.Process | joblib.Parallel | Docker-wrapped | ... }-spawned replicas, which each tries to acquire an ownership of some ZeroMQ Transport Class address, which will for obvious reasons fail to succeed for any attempt after a first one succeeded.

    2 ) a rather fatal situation, where some "previously"-run process did not manage to release such Transport Class specific address for further use ( do not remember that ZeroMQ might be just one of more other interested candidates - a Configuration Management flaw ), or in such cases, where previous runs failed to gracefully terminate such resource usage and left remained a Context()-instance still waiting ( infinitely in some cases till an O/S reboot ) listening for something, that will never happen.

    3 ) an indeed bad engineering practice in module software design, not to handle the ZeroMQ API documented EADDRINUSE error / exception less brutally than to just crash the whole circus (at all associated costs of that)


    The other error message : RuntimeError: Kernel died before replying to kernel_info related to a state, that the notebook's kernel was trying so long to establish all internal connections with its own components ( pool-peers ) that it took waiting longer than a configured or hardcoded timeout and the kernel-process simply stopped waiting any more and threw itself into an otherwise unhandled exception you observed and reported.

    Solution

    Check for any hanging address-owners first, reboot all nodes if in doubts on this, next verify there are no colliding attempts "hidden" inside your own code / { multiprocessing.Process() | joblib.Parallel() | ... the likes }-calls, that after distributed may try to .bind() onto the same target. If none of these steps salvage the trouble within your domain of control, ask the modules' used Support, to analyze and help you refactor & validate your still colliding use-case.