python-3.xdockeraws-batch

How do you run multiple AWS Batch jobs with MultiProcessing, mp.Manager(), without conflicting port addresses error Address already in use?


How do you fix AWS Batch conflicting port allocation when using multiprocessing? I am running multiple batch containers using multiprocessing. When two batch jobs run they fail with Address already in use. This SO question has the same problem Docker container with Selenium and Chrome webdriver crashes when multiple containers run in parallel on AWS Batch.

This is the problem

AWS Batch communicates with Compute Resources via the ECS Agent which is instructed to start jobs with NetworkMode set to "host" as you have already determined. Currently the service is not designed to run Jobs that are listening for external network requests into the container instance.

Code:

    import multiprocessing as mp
    ...
    proc = []
    mgr = mp.Manager()
    mgr_queue = mgr.Queue()
    p = mp.Process(target=func, args=(x, y, mgr_queue))
    p.start()
    ...

Error:

Process SyncManager-1: Traceback (most recent call last): File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 583, in _run_server server = cls._Server(registry, address, authkey, serializer) File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 156, in init self.listener = Listener(address=address, backlog=16) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 453, in init self._listener = SocketListener(address, family, backlog) File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 596, in init self._socket.bind(address) OSError: [Errno 98] Address already in use

Based on the documentation for Manager which is SyncManager with a BaseManager

class multiprocessing.managers.BaseManager([address[, authkey]])
Create a BaseManager object.

Once created one should call start() or get_server().serve_forever() to ensure that the manager object refers to a started manager process.

address is the address on which the manager process listens for new connections. If address is None then an arbitrary one is chosen.

Since I do not provide a port to mp.Manager() it appears as though it would be an arbitrary port but that is not true, how do you fix this? I would happily change the networking "host" if possible.


Solution

  • Turns out the bug is due to the implementation of Unix abstract addresses in Python 3.9 (here). This caused the default socket address for the multiprocessing.Manager() on Unix machines to be not very random. As noted by the OP, containers share their addresses with the host when running on AWS Batch, so this non-randomness makes it very likely (or even guaranteed) for the socket addresses to conflict across containers on the same machine. The suggested fix is to set multiprocessing.util.abstract_sockets_supported = False before initializing your Manager. This makes the socket addresses extremely random again, ensuring they won't conflict.

    So just add:

    import multiprocessing
    import multiprocessing.util
    ...
    multiprocessing.util.abstract_sockets_supported = False
    mgr = multiprocessing.Manager()
    ...