[SOLVED] Is there a PyTorch with CUDA Unified GPU-CPU Memory fork?

Is there a PyTorch with CUDA Unified GPU-CPU Memory fork?

So Training a DNN model can be a pain when a batch of one image takes 15GB. Speed is not so important for me, yet to fit bigger batches (and models is). So I wonder if there is a PyTorch with CUDA Unified Memory fork or something like that to fit giant models (having 16gb per GPU RAM, yet 250 on CPU side it seems quite resonable)?

Solution

If you do not care about the time it takes, but need large batches, you can use a more slow approach. Say you need to have a batch of 128 samples but your gpu memory fits only 8 samples. You can create smaller batches of 8 samples and then average their gradients.

For each small batch of 8 samples that you evaluate, you keep the .grad of each parameter in your cpu memory. You keep a list of grads for each of your models parameters. After you have gathered the grads for 16 batches of 8 samples (128 samples in total) you can average the gradients of each parameter and put the result back into the .grad attribute of each parameter.

You can then call the .step() of your optimizer. This should yield exactly the same results as if you were using a large batch of 128 samples.