pythontensorflowray

Ray: How to run many actors on one GPU?


I have only one gpu, and I want to run many actors on that gpu. Here's what I do using ray, following https://ray.readthedocs.io/en/latest/actors.html

  1. first define the network on gpu
class Network():
    def __init__(self, ***some args here***):
        self._graph = tf.Graph()
        os.environ['CUDA_VISIBLE_DIVICES'] = ','.join([str(i) for i in ray.get_gpu_ids()])
        with self._graph.as_default():
            with tf.device('/gpu:0'):
                # network, loss, and optimizer are defined here

        sess_config = tf.ConfigProto(allow_soft_placement=True)
        sess_config.gpu_options.allow_growth=True
        self.sess = tf.Session(graph=self._graph, config=sess_config)
        self.sess.run(tf.global_variables_initializer())
        atexit.register(self.sess.close)

        self.variables = ray.experimental.TensorFlowVariables(self.loss, self.sess)
  1. then define the worker class
@ray.remote(num_gpus=1)
class Worker(Network):
    # do something
  1. define the learner class
@ray.remote(num_gpus=1)
class Learner(Network):
    # do something
  1. train function
def train():
    ray.init(num_gpus=1)
    leaner = Learner.remote(...)
    workers = [Worker.remote(...) for i in range(10)]
    # do something

This process works fine when I don't try to make it work on gpu. That is, it works fine when I remove all with tf.device('/gpu:0') and (num_gpus=1). The trouble arises when I keep them: It seems that only learner is created, but none of the workers is constructed. What should I do to make it work?


Solution

  • When you define an actor class using the decorator @ray.remote(num_gpus=1), you are saying that any actor created from this class must have one GPU reserved for it for the duration of the actor's lifetime. Since you have only one GPU, you will only be able to create one such actor.

    If you want to have multiple actors sharing a single GPU, then you need to specify that each actor requires less than 1 GPU, for example, if you wish to share one GPU among 4 actors, then you can have each actor require 1/4th of a GPU. This can be done by declaring the actor class with

    @ray.remote(num_gpus=0.25)
    

    In addition, you need to make sure that each actor actually respects the limits that you are placing on it. For example, if you want declare an actor with @ray.remote(num_gpus=0.25), then you should also make sure that TensorFlow uses at most one quarter of the GPU memory. See the answers to How to prevent tensorflow from allocating the totality of a GPU memory? for example.