pythonpython-3.xmxnetcudnn

mxnet (gluon): cpu used when gpu(0) context selected


EDIT 02/2018 After writing my own code with data stored locally and less clunky accuracy metric calculations I saw significant speed up. GPU also rinses CPU in any CNN I have tried building in mxnet; even just using MNIST. I believe my issue was linked to the tutorial code and no longer consider this a real problem.

I am running through the 'Multilayer perceptrons in gluon' MNIST tutorial on http://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-gluon.html

(same code except setting context to gpu(0), sequential model used)

I am in Windows 10. Using python 3 (anaconda), installed CUDA 9.0, and cuDNN v7.0.5 for 9.0, then mxnet_cu90 installed from pip.

I set the data and model contexts to gpu(0), but my gtx 1080 hovers around 1-4% usage (whether or not the script is running), whilst my 8 Xeon cores ramp up to around 50-60% through the epochs. There was no difference in training time regardless of context. When I print the params after training it says they are NDArray size gpu(0), so it definitely thinks it's using the gpu.

EDIT: Replicated on my laptop at home (gpu:GTX980m, cpu:I7 4710HQ). In this case the gpu was utilized: 980m went from 0% to 12% use each epoch. However, cpu was also used >40% load, and, the gpu context training was actually slower than on the cpu.

I am starting to think that because this is a simple problem with MNIST/ANN, gpu is just not challenged. Maybe I will see far more impact of gpu usage when training a CNN.

I am still a little confused though, as I never had these issues when I used TensorFlow; where utilizing gpu generally always outperformed my cpu.

Any help appreciated, Thanks, T.

EDIT: CODE AS REQUESTED:

#MULTILAYER PERCEPTRONS IN GLUON (MNIST)
#MODIFIED FROM: http://gluon.mxnet.io/chapter03_deep-neural-networks/mlp-gluon.html

#IMPORT REQUIRED PACKAGES
import numpy as np
import mxnet as mx
from mxnet import nd, autograd, gluon
import datetime #for comparing training times

#SET THE CONTEXTS (GPU/CPU)
ctx = mx.gpu(0) #note: original tutorial sets separate context variable for data/model. The data_ctx was never used so i submitted an issue on github and use a single ctx here
#ctx = mx.cpu()

#PREDEFINE SOME USEFUL NUMBERS
batch_size = 64
num_inputs = 784
num_outputs = 10 #ten hand written digits [0-9]
num_examples = 60000

#LOAD IN THE MNIST DATASET
def transform(data, label):
    return data.astype(np.float32)/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train = True, transform = transform), batch_size, shuffle = True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train = False, transform = transform), batch_size, shuffle = False)

#MAKE SEQUENTIAL MODEL

num_hidden = 64
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(num_hidden, activation = "relu"))
    net.add(gluon.nn.Dense(num_hidden, activation = "relu"))
    net.add(gluon.nn.Dense(num_outputs))

net.collect_params().initialize(mx.init.Normal(sigma = 0.01), ctx = ctx)

#SETUP THE FUNCTIONS FOR TRAINING

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() #LOSS
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01}) #OPTIMIZER

#DEFINE A LOOP TO TEST THE ACCURACY OF THE MODEL ON A TEST SET
def evaluate_accuracy(data_iterator, net):
    acc = mx.metric.Accuracy()
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(ctx).reshape((-1,784))
        label = label.as_in_context(ctx)
        output = net(data)
        predictions = nd.argmax(output, axis = 1)
        acc.update(preds = predictions, labels = label)
    return acc.get()[1] #get the accuracy value from the mxnet accuracy metric

#TRAINING LOOP
epochs  = 10
smoothing_constant = 0.01
start_time = datetime.datetime.now()

for e in range(epochs):
    cumulative_loss = 0
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(ctx).reshape((-1, 784))
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(data.shape[0])
        cumulative_loss += nd.sum(loss).asscalar()
    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, cumulative_loss/num_examples, train_accuracy, test_accuracy))

#I ADDED THIS TO GET THE FINAL PARAMETERS / NDARRAY CONTEXTS    
params = net.collect_params()
for param in params.values():
    print(param.name,param.data())

#I ADDED THIS TO COMPARE THE TIMING I GET WHEN SETTING THE CTX AS GPU/CPU   
end_time = datetime.datetime.now()
training_time = end_time - start_time
print("In h/m/s, total training time was: %s" % training_time)

RESULTS FOR CPU CONTEXT: cmd output for params and total training time (cpu)

RESULTS FOR GPU CONTEXT (Actually took longer): cmd output for params and total training time (gpu)


Solution

  • There are a few things that are affecting your performance.

    1. Your training is limited by DataLoader. Use num_workers to increase the number of processes fetching and pre-processing the data into NDArrays to ensure your GPU isn't starving. For example train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform), batch_size, shuffle=True, num_workers=4)

    2. built-in metrics in MXNet are inefficient at the moment, specially when batch-size is quite small. In you profile the training loop (using simple time()), you'll notice that majority of the time is spent in accuracy calculation rather than training. However, this is typically not an issue in a real DL training session because often the training data size is much larger than the validation data size and you don't normally end up calculating both training and validation accuracy, the way shown in the tutorial.

    Overall though, you're not going to get a huge bump in GPU utilization because the tutorial network and the dataset are very simple.