I found some similar topics but no helpful solution was found. Since I have some more information to provide, I opened this issue.
My PyTorch script frequently gets stuck on a training server.
Htop shows that there is only one green
CPU bar while other active cores are almost 100% red
. According to the F1
explanation, red means kernel time.
Whenever this 100% red CPU bar occurs, the training gets stuck and GPU-util drops down to 0%. Wired thing is this only happens on two of the servers I use. It never happens on my PC (less powerful) and never happens on another powerful server.
The strace
command shows that when the problem occurs, there will be many
futex(0x55bbb0e82db0, FUTEX_WAKE_PRIVATE, 1) = 0
Any explanation on what the problem is and how to avoid this. Or any further information to provide?
I solved the problem and found possible causes.
The CPU usage is high means the CPU is working, so this means no disk IO limitation is happening.
The GPU usage is low means that GPU is not correctly fed.
This means RAM is the most likely bottleneck for my case.
As mentioned in the GitHub issue, multi-process accessing the same python object causes the object ref-count to increase. In fork mode, this triggers page allocation thus slowing down the system performance.
This system behavior can not be detected by python memory allocation libs such as Memray[https://github.com/bloomberg/memray] or so. But might be detected by other system-level memory tools such as Valgrind [https://valgrind.org/]
https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662
The final solution is to reduce accessing python objects from the forked process.