tensorflowkerasmpihdf5mpi4py

How to train tensorflow.keras models in parallel using gpu? Tensorflow version 2.5.0


I have the following code running a custom model that I have in a different module and takes as input several parameters (learning rate, convolution kernel size, etc)

custom_model is a function that compiles a tensorflow.keras.models.Model in tensorflow and return the model.

I loaded both of them through a hdf5 file but the dataset are quite large of order of 10 GB.

Normally I run this in jupyter-lab with no problems and the model does not consume the resources on the GPU. At the end I save the weights for the different parameters.

Now my question is:

How do I make this as a script and run this in parallel for different values of k1 and k2. I guess something like a bash loop will do, but I want to avoid re-reading the dataset. I am using Windows 10 as an operating system.

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for gpu_instance in physical_devices: 
    tf.config.experimental.set_memory_growth(gpu_instance, True)
import h5py

from model_custom import custom_model
winx = 100
winz = 10
k1 = 9
k2 = 5

with h5py.File('MYFILE', 'r') as hf:
    LOW = hf['LOW'][:]
    HIGH = hf['HIGH'][:]

with tf.device("/gpu:1"):
    mymodel = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k1, kz2=k2)
    myhistory = mymodel.fit(LOW, HIGH, batch_size=1, epochs=1)
    mymodel.save_weights('zkernel_{}_kz1_{}_kz2_{}.hdf5'.format(winz, k1,k2))


Solution

  • I found that this solution works fine for me. This enables to run parallel model training in the gpus using MPI with mpi4py. There is only one issue with this when I try to load big files and run many process together so that the number of processes times the data that I load exceeds my ram capacity.

    from mpi4py import MPI 
    import tensorflow as tf
    physical_devices = tf.config.list_physical_devices('GPU') 
    for gpu_instance in physical_devices: 
        tf.config.experimental.set_memory_growth(gpu_instance, True)
    import h5py
    from model_custom import custom_model
    
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    
    winx = 100
    winy = 100
    winz = 10
    
    if rank == 10:
        with h5py.File('mifile.hdf5', 'r') as hf:
            LOW = hf['LOW'][:]
            HIGH = hf['HIGH'][:]
    else:
        HIGH = None
        LOW= None
    HIGH = comm.bcast(HIGH, root=10)
    LOW = comm.bcast(LOW, root=10)
        
    if rank < 5:
        with tf.device("/gpu:1"):
            k = 9
            q = rank +1
            mymodel1 = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k, kz2=q)
            mymodel1._name = '{}_{}_{}'.format(winz,k,q)
            myhistory1 = mymodel1.fit(LOW, HIGH, batch_size=1, epochs=1)
            mymodel1.save_weights(mymodel1.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz, k,q))
    
    elif 5 <= rank < 10: 
        with tf.device("/gpu:2"):
            k = 8
            q = rank +1 -5
            mymodel2 = custom_model(winx,winz,lrate=0.001,usebias=True,kz1=k, kz2=q)
            mymodel2._name = '{}_{}_{}'.format(winz,k,q)
            myhistory2 = mymodel2.fit(LOW, HIGH, batch_size=1, epochs=1)
            mymodel2.save_weights(mymodel2.name +'winz_{}_k_{}_q_{}.hdf5'.format(winz, k,q))
    

    then I save to a python module with name mycode.py and then I run in the console

    mpiexec -n 11 python ./mycode.py