macoskerasdeep-learningtensorflow2.0apple-m1

Why GPU is 3.5 times slower than the CPU on Apple M1 Mac?


I was building a simple network with Keras on M1 MacBook Air, and I installed the official recommended tensorflow-metal expecting to get faster training or predicting speed. However the GPU predicted 3.5 times slower than the CPU did, which confused me. And here is my code, outputs with and without GPU enabled:

import time

import numpy as np
from keras.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers


class CNNModel(object):
    def __init__(self, input_shape=(29, 1), num_classes=6, model_path=None):
        self.model = keras.Sequential(
            [
                keras.Input(input_shape),
                layers.Conv1D(16, kernel_size=3, activation="relu"),
                layers.MaxPooling1D(pool_size=3),
                layers.Conv1D(32, kernel_size=3, activation="relu"),
                layers.MaxPooling1D(pool_size=3),
                layers.Flatten(),
                layers.Dropout(0.5),
                layers.Dense(32, activation="sigmoid"),
                layers.Dense(num_classes, activation='softmax')
            ]
        )
        self.model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
        if model_path is not None:
            self.model.load_weights(model_path)

    def predict(self, x):
        preds = self.model.predict(x)
        preds = np.argmax(preds, axis=1)
        return preds

    def fit(self, x, y, model_save_path, batch_size=64, epochs=30):
        history = self.model.fit(x, y, batch_size=batch_size, epochs=epochs, validation_split=0.2,
                                 callbacks=[ModelCheckpoint(filepath=model_save_path, save_weights_only=True,
                                                            monitor='val_accuracy', mode='max', save_best_only=True)])


if __name__ == '__main__':
    model_path = "test.h5"
    sample_size = 20000
    data_x, data_y = np.random.random((sample_size, 29)), np.random.randint(0, 12, size=(sample_size, 1))
    class_num = np.unique(data_y).shape[0]
    data_y = keras.utils.to_categorical(data_y, class_num)
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(data_x, data_y, test_size=0.2)
    model = CNNModel(input_shape=(Xtrain.shape[1], 1), num_classes=class_num)
    model.fit(Xtrain, Ytrain, batch_size=512, epochs=10, model_save_path=model_path)
    model = CNNModel(input_shape=(Xtrain.shape[1], 1), num_classes=class_num, model_path=model_path)
    since = time.time()
    preds = model.predict(Xtest)
    end = time.time()
    print(f'Predict {Xtest.shape[0]} samples in {end - since : .9f}s, {(end - since) / Xtest.shape[0]: .9f}s on avg')

and I got the output as follow when using GPU:

Metal device set to: Apple M1

systemMemory: 8.00 GB maxCacheSize: 2.67 GB

2022-01-10 21:07:47.974952: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-01-10 21:07:47.975053: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) 2022-01-10 21:07:48.039236: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz Epoch 1/10 2022-01-10 21:07:48.206631: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 23/25 [==========================>...] - ETA: 0s - loss: 2.5483 - accuracy: 0.08282022-01-10 21:07:48.674379: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. 25/25 [==============================] - 1s 18ms/step - loss: 2.5446 - accuracy: 0.0839 - val_loss: 2.4955 - val_accuracy: 0.0850 Epoch 2/10 25/25 [==============================] - 0s 15ms/step - loss: 2.4923 - accuracy: 0.0870 - val_loss: 2.4852 - val_accuracy: 0.0875 Epoch 3/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4864 - accuracy: 0.0863 - val_loss: 2.4851 - val_accuracy: 0.0866 Epoch 4/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4866 - accuracy: 0.0841 - val_loss: 2.4851 - val_accuracy: 0.0862 Epoch 5/10 25/25 [==============================] - 0s 14ms/step - loss: 2.4863 - accuracy: 0.0826 - val_loss: 2.4849 - val_accuracy: 0.0869 Epoch 6/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4855 - accuracy: 0.0909 - val_loss: 2.4850 - val_accuracy: 0.0800 Epoch 7/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4861 - accuracy: 0.0843 - val_loss: 2.4848 - val_accuracy: 0.0884 Epoch 8/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4852 - accuracy: 0.0848 - val_loss: 2.4852 - val_accuracy: 0.0803 Epoch 9/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4848 - accuracy: 0.0880 - val_loss: 2.4846 - val_accuracy: 0.0866 Epoch 10/10 25/25 [==============================] - 0s 13ms/step - loss: 2.4846 - accuracy: 0.0871 - val_loss: 2.4851 - val_accuracy: 0.0875 2022-01-10 21:07:51.840891: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled. Predict 4000 samples in 0.259644985s, 0.000064911s on avg

and I got this after uninstalling the tensorFlow-metal with python -m pip uninstall tensorflow-metal:

Epoch 1/10 25/25 [==============================] - 0s 6ms/step - loss: 2.6182 - accuracy: 0.0824 - val_loss: 2.5252 - val_accuracy: 0.0878 Epoch 2/10 25/25 [==============================] - 0s 3ms/step - loss: 2.5025 - accuracy: 0.0863 - val_loss: 2.4898 - val_accuracy: 0.0791 Epoch 3/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4901 - accuracy: 0.0848 - val_loss: 2.4873 - val_accuracy: 0.0766 Epoch 4/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4894 - accuracy: 0.0844 - val_loss: 2.4865 - val_accuracy: 0.0847 Epoch 5/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4891 - accuracy: 0.0802 - val_loss: 2.4869 - val_accuracy: 0.0797 Epoch 6/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4876 - accuracy: 0.0811 - val_loss: 2.4876 - val_accuracy: 0.0828 Epoch 7/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4866 - accuracy: 0.0847 - val_loss: 2.4873 - val_accuracy: 0.0822 Epoch 8/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4867 - accuracy: 0.0841 - val_loss: 2.4867 - val_accuracy: 0.0838 Epoch 9/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4870 - accuracy: 0.0860 - val_loss: 2.4867 - val_accuracy: 0.0787 Epoch 10/10 25/25 [==============================] - 0s 3ms/step - loss: 2.4860 - accuracy: 0.0883 - val_loss: 2.4870 - val_accuracy: 0.0744 Predict 4000 samples in 0.073775768s, 0.000018444s on avg


Solution

  • Last week I found the same issue and it confused me a lot too. In my case CPU training took ~7 sec and GPU ~100 sec. So GPU was 14 times slower! It was a simple ANN but on CNN I found that GPU was about 20% faster than CPU.

    I think it depends on your input size. GPU kernels are way more slower than CPU cores but the main advantage of GPU is that you can run thousands of threads simultaneously. On CPU you're limited by number of cores and besides even if M1 has 8 cores only 4 of them can work simultaneously.

    So if your batch for training is small enough you won't have benefits from GPU because a lot of threads will not be used. They can't process separate batch due to GPU architecture. I suggest you to test GPU and CPU performance on small number of epochs and then choose the faster unit.

    You don't need to uninstall tensorflow-metal to use only CPU. You can simply call

    tf.config.set_visible_devices([], 'GPU')
    

    before compiling the NN. This command will remove all the GPUs from visible devices for TensorFlow so training will use only CPU.