kubernetesgpucluster-computingslurmdocker-datacenter

Setting up a multi-user job scheduler for data science / ML tasks


Background

Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks Tim!


Solution

  • As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

    There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

    I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

    This can be used in a following way:

    apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs

    That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

    I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

    Also you can read about Schedule GPUs which is still experimental.