memory-managementmemory-leakstensorflow2.0rayrllib

RAM Usage Keeps Going Up While Training an RL Network Using RLLib and TensorFlow


I have been using older versions of Ray and TensorFlow, but recently transitioned to the following most up-to-date versions on a Linux Ubuntu 20.04 setup.

ray==2.0.0
tensorflow==2.10.0
cuDNN==8.1
CUDA==11.2

While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ram_util_percent. My training session keeps crashing, and this behavior was not there with earlier versions of ray and tensorflow.

enter image description here

Below are the things I have tried so far:

  1. Set the argument reuse_actors = True in ray.tune.run()
  2. Limited object_store_memory to a certain amount, currently at 0.25 Gb
  3. According to this and this, set core file size to unlimited, and increased open files count

None of these methods have helped so far. As a temporary workaround, I am calling Python's garbage collector to free up unused memory when memory usage reaches 80%. I am not sure if this method will continue to mitigate the issue if I go for higher time steps of training; my guess is no.

def collectRemoveMemoryGarbage(self, percThre = 80.0):
    """
    :param percThre: PERCENTAGE THRESHOLD FLOATING POINT VALUE
    :return: N/A
    """
    if psutil.virtual_memory().percent >= percThre:
        _ = gc.collect()

Does anyone know a better approach to this problem? I know this is a well discussed problem in the ray GitHub issues page. This might end up being an annoying bug in ray or tensorflow, but I am looking for feedback from others who are well-versed in this area.


Solution

  • Increasing the checkpoint_freq argument within ray.tune.run() helped me achieve 5e6 time steps without any crash due to running out of memory; previously, it was 10, now it's 50.

    It seems that not check-pointing frequently does the trick.

    enter image description here

    I will try out higher number of time steps.