When using multiple GPUs to perform inference on a model (e.g. the call method: model(inputs)) and calculate its gradients, the machine only uses one GPU, leaving the rest idle.
For example in this code snippet below:
import tensorflow as tf
import numpy as np
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# Make the tf-data
path_filename_records = 'your_path_to_records'
bs = 128
dataset = tf.data.TFRecordDataset(path_filename_records)
dataset = (dataset
.map(parse_record, num_parallel_calls=tf.data.experimental.AUTOTUNE)
.batch(bs)
.prefetch(tf.data.experimental.AUTOTUNE)
)
# Load model trained using MirroredStrategy
path_to_resnet = 'your_path_to_resnet'
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
resnet50 = tf.keras.models.load_model(path_to_resnet)
for pre_images, true_label in dataset:
with tf.GradientTape() as tape:
tape.watch(pre_images)
outputs = resnet50(pre_images)
grads = tape.gradient(outputs, pre_images)
Only one GPU is used. You can profile the behavior of the GPUs with nvidia-smi. I don't know if it is supposed to be like this, both the model(inputs)
and tape.gradient
to not have multi-GPU support. But if it is, then it's a big problem because if you have a large dataset and need to calculate the gradients with respect to the inputs (e.g. interpretability porpuses) it might take days with one GPU.
Another thing I tried was using model.predict()
but this isn't possible with tf.GradientTape
.
What I've tried so far and didn't work
strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
.strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
as @Kaveh suggested.How do I know that only one GPU is working?
I used the command watch -n 1 nvidia-smi
in the terminal and observed that only one GPU is at 100%, the rest are at 0%.
Working Example
You can find a working example with a CNN trained on the dogs_vs_cats datasets below. You won't need to manually download the dataset as I used the tfds version, nor train a model.
Notebook: Working Example.ipynb
Saved Model:
It is supposed to run in single gpu (probably the first gpu, GPU:0
) for any codes that are outside of mirrored_strategy.run()
. Also, as you want to have the gradients returned from replicas, mirrored_strategy.gather()
is needed as well.
Besides these, a distributed dataset must be created by using mirrored_strategy.experimental_distribute_dataset
. Distributed dataset tries to distribute single batch of data across replicas evenly. An example about these points is included below.
model.fit()
, model.predict()
,and etc... run in distributed manner automatically just because they've already handled everything mentioned above for you.
Example codes:
mirrored_strategy = tf.distribute.MirroredStrategy()
print(f'using distribution strategy\nnumber of gpus:{mirrored_strategy.num_replicas_in_sync}')
dataset=tf.data.Dataset.from_tensor_slices(np.random.rand(64,224,224,3)).batch(8)
#create distributed dataset
ds = mirrored_strategy.experimental_distribute_dataset(dataset)
#make variables mirrored
with mirrored_strategy.scope():
resnet50=tf.keras.applications.resnet50.ResNet50()
def step_fn(pre_images):
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch(pre_images)
outputs = resnet50(pre_images)[:,0:1]
return tf.squeeze(tape.batch_jacobian(outputs, pre_images))
#define distributed step function using strategy.run and strategy.gather
@tf.function
def distributed_step_fn(pre_images):
per_replica_grads = mirrored_strategy.run(step_fn, args=(pre_images,))
return mirrored_strategy.gather(per_replica_grads,0)
#loop over distributed dataset with distributed_step_fn
for result in map(distributed_step_fn,ds):
print(result.numpy().shape)