pythontensorflowgpumulti-gpu

Multi-GPU training in Tensorflow results in Nans


I am trying to train using multiple GPUs, however the loss is always Nan after a few steps. If I use a single GPU, then its fine. Below shows a dummy script that results in nan's after a few steps. Below the code shows the output of the print statements I placed at the top for tensorflow build information/GPU information.

The tensorflow version is 2.18.0

import tensorflow as tf
import numpy as np

print(tf.sysconfig.get_build_info())

# Check if GPUs are available
gpus = tf.config.list_physical_devices('GPU')
print(gpus)
if gpus:
    print(f"Number of GPUs available: {len(gpus)}")
else:
    print("No GPUs found. Training will proceed on CPU.")

# Define the strategy for multi-GPU training
strategy = tf.distribute.MirroredStrategy()

# Dummy dataset
def create_dummy_dataset(samples=50000):
    # Generate dummy input data 
    X = np.random.random((samples, 20)).astype(np.float32)
    # Generate dummy labels (binary classification)
    y = np.random.randint(0, 2, (samples, 1)).astype(np.float32)
    return tf.data.Dataset.from_tensor_slices((X, y)).shuffle(samples).batch(32)

dataset = create_dummy_dataset()

# Define the model inside the strategy scope
with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

# Train the model
model.fit(dataset, epochs=10)
OrderedDict(
{'cpu_compiler': '/usr/lib/llvm-18/bin/clang', 'cuda_compute_capabilities': ['sm_60', 'sm_70', 'sm_80', 'sm_89', 'compute_90'], 'cuda_version': '12.5.1', 'cudnn_version': '9', 'is_cuda_build': True, 'is_rocm_build': False, 'is_tensorrt_build': False}
)
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] 
Number of GPUs available: 2 

Solution

  • Using Keras 3.8.0 and even 3.7.0 and tensorflow 2.18 causes this issue. This was fixed after I downgraded keras to 3.6.0.