For example, the RNN is a dynamic 3-layer bidirectional LSTM with the hidden vector size of 200 (tf.nn.bidirectional_dynamic_rnn
) and I have 4 GPUs to train the model. I saw a post using data parallelism
on subsets of samples in a batch but that didn't speed up the training process.
You can also try model parallelism. One way to do this is to make a cell wrapper like this, which will create cells on a specific device:
class DeviceCellWrapper(tf.nn.rnn_cell.RNNCell):
def __init__(self, cell, device):
self._cell = cell
self._device = device
@property
def state_size(self):
return self._cell.state_size
@property
def output_size(self):
return self._cell.output_size
def __call__(self, inputs, state, scope=None):
with tf.device(self._device):
return self._cell(inputs, state, scope)
Then place each individual layer onto dedicated GPU:
cell_fw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
cell_bw = DeviceCellWrapper(cell=tf.nn.rnn_cell.LSTMCell(num_units=n_neurons, state_is_tuple=False), device='/gpu:0')
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, X, dtype=tf.float32)