tensorflowdistributed-computinglstmrecurrent-neural-networkmultiple-gpu

How to speed up the training of an RNN model with multiple GPUs in TensorFlow?


For example, the RNN is a dynamic 3-layer bidirectional LSTM with the hidden vector size of 200 (tf.nn.bidirectional_dynamic_rnn) and I have 4 GPUs to train the model. I saw a post using data parallelism on subsets of samples in a batch but that didn't speed up the training process.


Solution

  • You can also try model parallelism. One way to do this is to make a cell wrapper like this, which will create cells on a specific device:

    class DeviceCellWrapper(tf.nn.rnn_cell.RNNCell):
      def __init__(self, cell, device):
        self._cell = cell
        self._device = device
    
      @property
      def state_size(self):
        return self._cell.state_size
    
      @property
      def output_size(self):
        return self._cell.output_size
    
      def __call__(self, inputs, state, scope=None):
        with tf.device(self._device):
          return self._cell(inputs, state, scope)
    

    Then place each individual layer onto dedicated GPU:

    cell_fw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
    cell_bw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
    outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, X, dtype=tf.float32)