memory-managementjuliagpugoogle-compute-engineflux.jl

Stepwise decrease GPU utility followed by out of memory error


I am running a 3D image segmentation deep learning training pipeline on a GCloud VM and am noticing a stepwise decrease of the GPU utility after about 25 epochs and an out of memory error after 32 epochs. Since such a training pipeline is basically the same loop over the data over and over again, and since all other main metrics do not show such a pattern change, I don't understand why the first epochs are fine and it then suddenly occurs.

Could this be some kind of memory leak on the GPU? Could GCloud apply some kind of throttling based on the GPU temperature?

enter image description here

Some context info:

Some things I've tried:


Solution

  • The problem was resolved by changing my optimiser from Flux.Nesterov to Optimisers.Nesterov as suggested here. Apparently the Flux optimisers gathers some kind of state whereas the ones from Optimisers.jl do not.