kerasdeep-learninggoogle-colaboratorygoogle-cloud-tputpu

Residual neural network model runs very slowly on google colab tpu hardware?


I've made a residual neural network model on Google Colab in keras for the cifar10 dataset, but it runs very slowly on TPU hardware.

I have another regular convolutional neural network that runs fine on google colab. This model uses the keras Sequential API and the residual neural network uses the Functional API, not sure if that is the issue. I've already tried changing the batch size and that did not help. The link to my program is below.

https://colab.research.google.com/github/valentinocc/Keras_cifar10/blob/master/keras_rnn_cifar10.ipynb#scrollTo=7Jc51Dbac2MC

Expect each epoch to finish in at least under one minute (usually around 10 seconds max) but it seems that each mini-batch takes a full minute on its own to complete (and there are many mini-batches per epoch).


Solution

  • It seams like your model isn't running on the TPU hardware but rather on the CPU. In order to run training/prediction on a TPU for a Tensorflow Keras model, you will create a TPUStrategy and compile your model within that strategy scope:

    resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    strategy = tf.distribute.experimental.TPUStrategy(resolver)
    
    with strategy.scope():
      model = create_model()
      model.compile(optimizer='adam',
                    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                    metrics=['sparse_categorical_accuracy'])
    

    For more info please follow the tpu guide.