I am trying to train using multiple GPUs, however the loss is always Nan after a few steps. If I use a single GPU, then its fine. Below shows a dummy script that results in nan's after a few steps. Below the code shows the output of the print statements I placed at the top for tensorflow build information/GPU information.
The tensorflow version is 2.18.0
import tensorflow as tf
import numpy as np
print(tf.sysconfig.get_build_info())
# Check if GPUs are available
gpus = tf.config.list_physical_devices('GPU')
print(gpus)
if gpus:
print(f"Number of GPUs available: {len(gpus)}")
else:
print("No GPUs found. Training will proceed on CPU.")
# Define the strategy for multi-GPU training
strategy = tf.distribute.MirroredStrategy()
# Dummy dataset
def create_dummy_dataset(samples=50000):
# Generate dummy input data
X = np.random.random((samples, 20)).astype(np.float32)
# Generate dummy labels (binary classification)
y = np.random.randint(0, 2, (samples, 1)).astype(np.float32)
return tf.data.Dataset.from_tensor_slices((X, y)).shuffle(samples).batch(32)
dataset = create_dummy_dataset()
# Define the model inside the strategy scope
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model
model.fit(dataset, epochs=10)
OrderedDict(
{'cpu_compiler': '/usr/lib/llvm-18/bin/clang', 'cuda_compute_capabilities': ['sm_60', 'sm_70', 'sm_80', 'sm_89', 'compute_90'], 'cuda_version': '12.5.1', 'cudnn_version': '9', 'is_cuda_build': True, 'is_rocm_build': False, 'is_tensorrt_build': False}
)
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
Number of GPUs available: 2
Using Keras 3.8.0 and even 3.7.0 and tensorflow 2.18 causes this issue. This was fixed after I downgraded keras to 3.6.0.