tensorflowrayray-tune

Out of memory at every second trial using Ray Tune


I am tuning the hyperparameters using ray tune. The model is built in the tensorflow library, it occupies a large part of the available GPU memory. I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in the GPU memory usage graph, this is the moment between calls of consecutive trials, between which the OOM error occurred. I add that on smaller models I do not encounter this error and the graph looks the same.

How to deal with this out of memory error in every second trial ?

Memory usage graph


Solution

  • There's actually a utility that helps avoid this:

    https://docs.ray.io/en/master/tune/api_docs/trainable.html#ray.tune.utils.wait_for_gpu

    def tune_func(config):
        tune.utils.wait_for_gpu()
        train()
    
    tune.run(tune_func, resources_per_trial={"GPU": 1}, num_samples=10)