I am currently using fastai to train computer vision models.
I use a development environment of this style.
On this machine we have :
CPU 16 cores
RAM 64go
GPU Nvidia A100
SSD 200go
I devellope on a jupyterlab container, on a 1 node docker swarm cluster. The jupyterlab instance is installed on this image : nvcr.io/nvidia/pytorch:23.01-py3
When I launch a training the GPU is not used at 100% it is more or less at 20% and the GPU memory is well exploded according to my batch_size. Here is a screenshot :
I run a training via pytorch with the same model, the same data and similar hyperparameters and with pytorch it uses 100% of the GPU power.
I tried to install different versions of pytorch, fastai, cuda but nothing works with fastai the use of my GPU is always limited to 20%.
Would you have a reflection track, to help me to find a solution please?
I tried to install different versions of pytorch, fastai, cuda but nothing works with fastai the use of my GPU is always limited to 20%.
thank you for your feedback,
After more hours of investigation I found out what was slowing down my GPU because of this callback ActivationStats
here is the code of my learner:
learn = vision_learner(
dls,
'resnet18',
metrics=[accuracy, error_rate],
cbs=[
CSVLogger(fname='PTO_ETIQUETTE.csv'),
EarlyStoppingCallback(monitor='valid_loss', min_delta=0.3, patience=10),
ActivationStats(with_hist=True)
],
pretrained=True
)
I don't understand why this callback slows down so much the GPU performance ?