dockerdaskdask-distributeddask-kubernetes

In dask, what is the easiest way to run a task that itself runs a docker container?


The following code maps a function over an iterable. The function being applied to each element runs a docker container in order to compute its return value:

import subprocess

def task(arg):
    return subprocess.check_output(
        ["docker", "run", "ubuntu", "bash", "-c", f"echo 'result_{arg}'"]
    )

args = [1, 2, 3]
for result in map(task, args):
    print(result.decode("utf-8").strip())
result_1
result_2
result_3

What is the easiest way to parallelize this computation over cloud compute resources in dask?

For example, it would be nice if one could do something like the following. But this of course does not work, because the docker containers on Fargate in which the python code is executing are running the default dask image, and thus do not have the ability to spawn a docker container themselves (I am not sure whether there is or is not a solution in this "docker-in-docker" direction):

import subprocess

from dask.distributed import Client
from dask_cloudprovider import FargateCluster
import dask.bag

def task(arg):
    return subprocess.check_output(
        ["docker", "run", "ubuntu", "bash", "-c", f"echo 'result_{arg}'"]
    )

cluster = FargateCluster(n_workers=1)
client = Client(cluster)
args = [1, 2, 3]
for result in dask.bag.from_sequence(args).map(task).compute():
    print(result)

I am looking for a solution that doesn't involve housing unrelated code in the same docker image. I.e. I want the docker image used by my task for its computation to be an arbitrary third party image that I do not have to alter by adding python/dask dependencies to. So I think that rules out solutions based on altering the image used by a worker node under dask_cloudprovider.FargateCluster/ECSCluster, since that will have to house python/dask dependencies.


Solution

  • Pulling a container onto a kubernetes node has significant overhead and can really only be justified if the task is long running (minutes, hours). daskis oriented towards low overhead python based tasks.

    In my opinion, dask is not the right tool to execute tasks that are container images. There are several other technologies that better support execution of container based tasks/workflows (Airflow's KubernetesExecutor or Argo Workflows for example).

    What you might consider is using dask_kubernetes inside a container based task to spin up an ephemeral cluster for the purposes of executing the computational work required.