I have been using older versions of Ray and TensorFlow, but recently transitioned to the following most up-to-date versions on a Linux Ubuntu 20.04 setup.
ray==2.0.0
tensorflow==2.10.0
cuDNN==8.1
CUDA==11.2
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ram_util_percent. My training session keeps crashing, and this behavior was not there with earlier versions of ray and tensorflow.
Below are the things I have tried so far:
reuse_actors = True
in ray.tune.run()
object_store_memory
to a certain amount, currently at 0.25 GbNone of these methods have helped so far. As a temporary workaround, I am calling Python's garbage collector to free up unused memory when memory usage reaches 80%. I am not sure if this method will continue to mitigate the issue if I go for higher time steps of training; my guess is no.
def collectRemoveMemoryGarbage(self, percThre = 80.0):
"""
:param percThre: PERCENTAGE THRESHOLD FLOATING POINT VALUE
:return: N/A
"""
if psutil.virtual_memory().percent >= percThre:
_ = gc.collect()
Does anyone know a better approach to this problem? I know this is a well discussed problem in the ray GitHub issues page. This might end up being an annoying bug in ray or tensorflow, but I am looking for feedback from others who are well-versed in this area.
Increasing the checkpoint_freq
argument within ray.tune.run()
helped me achieve 5e6 time steps without any crash due to running out of memory; previously, it was 10, now it's 50.
It seems that not check-pointing frequently does the trick.
I will try out higher number of time steps.