I am using the papermill library to run multiple notebooks using multiprocessing simultaneously.
This is occurring on Python 3.6.6, Red Hat 4.8.2-15 within a Docker container.
However when I run the python script, about 5% of my notebooks do not work immediately (No Jupyter Notebook cells run) due to me receiving this error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-124>", line 2, in initialize
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
self.init_sockets()
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
along with:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 77, in run_papermill
pm.execute_notebook(notebook, output_path, parameters=config)
File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
**engine_kwargs
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
preprocessor.preprocess(nb_man, safe_kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
with self.setup_preprocessor(nb_man.nb, resources, km=km):
File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
self.km, self.kc = self.start_new_kernel(**kwargs)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
kc.wait_for_ready(timeout=self.startup_timeout)
File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info
Please help me with this problem, as I have scoured the web trying different solutions, none that have worked for my case so far.
This error rate of 5% occurs regardless of the number of notebooks I run simultaneously or the number of cores on my computer which makes it extra curious.
I have tried changing the start method and updating the libraries but to no avail.
The version of my libraries are:
papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3
Thank you!
The clear problem attribution points towards ZeroMQ unable to successfully .bind()
.
The error message : zmq.error.ZMQError: Address already in use
is easier to explain. Whereas ZeroMQ AccessPoint-s can, for obvious reasons freely try to .connect()
to many counterparts, yet one and only one can .bind()
onto a particular Transport Class' Address Target.
There are three potential reasons for this happening:
1 ) an accidental calling some code (without knowing internal details)
via { multiprocessing.Process | joblib.Parallel | Docker-wrapped | ... }
-spawned replicas, which each tries to acquire an ownership of some ZeroMQ Transport Class address, which will for obvious reasons fail to succeed for any attempt after a first one succeeded.
2 ) a rather fatal situation, where some "previously"-run process did not manage to release such Transport Class specific address for further use ( do not remember that ZeroMQ might be just one of more other interested candidates - a Configuration Management flaw ), or in such cases, where previous runs failed to gracefully terminate such resource usage and left remained a Context()
-instance still waiting ( infinitely in some cases till an O/S reboot ) listening for something, that will never happen.
3 ) an indeed bad engineering practice in module software design, not to handle the ZeroMQ API documented EADDRINUSE
error / exception less brutally than to just crash the whole circus (at all associated costs of that)
The other error message : RuntimeError: Kernel died before replying to kernel_info
related to a state, that the notebook's kernel was trying so long to establish all internal connections with its own components ( pool-peers ) that it took waiting longer than a configured or hardcoded timeout and the kernel-process simply stopped waiting any more and threw itself into an otherwise unhandled exception you observed and reported.
Check for any hanging address-owners first, reboot all nodes if in doubts on this, next verify there are no colliding attempts "hidden" inside your own code / { multiprocessing.Process() | joblib.Parallel() | ... the likes }
-calls, that after distributed may try to .bind()
onto the same target. If none of these steps salvage the trouble within your domain of control, ask the modules' used Support, to analyze and help you refactor & validate your still colliding use-case.