I've been troubleshooting a test case on an EL9 environment which behaves differently in an EL7 environment.
It's specifically:
os.cwd()
returning 'no such file or directory'.Reproduced on:
Example reproduction code:
#!/usr/bin/python3 -u
import os
import multiprocessing
import time
if __name__=="__main__":
multiprocessing.set_start_method("spawn")
count = 0
while True:
try:
os.getcwd()
pool = multiprocessing.Pool(10)
pool.close()
pool.terminate()
count += 1
except Exception as e:
print(f"Failed after {count} iterations")
print(e)
break
I'm at something of a loss to understand quite what's going on here, and quite why it fails.
It seems to be connected to pool.terminate()
as if you add even a short (0.05) sleep before that, the problem stops occurring (at least, wasn't reproducible in 'sensible' amounts of time, where the above fails in <10 iterations).
And interestingly the cwd
stays inconsistent:
build-2[myuser]:[~/python_race]$ ./simple.py
Failed after 2 iterations
[Errno 2] No such file or directory
build-2[myuser]:[~/python_race]$ ./simple.py
Traceback (most recent call last):
File "<frozen importlib._bootstrap_external>", line 1362, in _path_importer_cache
KeyError: '.'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./simple.py", line 4, in <module>
import multiprocessing
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 982, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 925, in _find_spec
File "<frozen importlib._bootstrap_external>", line 1423, in find_spec
File "<frozen importlib._bootstrap_external>", line 1392, in _get_spec
File "<frozen importlib._bootstrap_external>", line 1364, in _path_importer_cache
File "<frozen importlib._bootstrap_external>", line 1340, in _path_hooks
File "<frozen importlib._bootstrap_external>", line 1610, in path_hook_for_FileFinder
File "<frozen importlib._bootstrap_external>", line 1486, in __init__
FileNotFoundError: [Errno 2] No such file or directory
This seems to 'recover' with a chdir, or waiting for what I assume is an NFS cache expiry interval.
We've picked this up due to some of our tests failing on the 'new version' on a more simplistic:
import multiprocessing
multiprocessing.set_start_method("spawn")
while True:
multiprocessing.Pool(10)
But as mentioned - specific to NFS mounts in subdirs which we think might be because the struct dentry
behaving a bit differently on root mounts.
But I was wondering if anyone could shed some insight into what might be going wrong here? I'm at a loss to identify whether it might be fileserver, linux kernel, something within python, or... well, somewhere else entirely.
With rpcdebug enabled we appear to get:
Apr 24 11:58:47 build-2 kernel: NFS: release(lib/libgcc_s.so.1)
Apr 24 11:58:47 build-2 kernel: NFS: release(lib/libc.so.6)
Apr 24 11:58:47 build-2 kernel: NFS: release(locale/locale-archive)
Apr 24 11:58:47 build-2 kernel: NFS reply getattr: -512
Apr 24 11:58:47 build-2 kernel: nfs_revalidate_inode: (0:53/3834468196) getattr failed, error=-512
Apr 24 11:58:47 build-2 kernel: NFS: lookup(/python_race)
Apr 24 11:58:47 build-2 kernel: NFS call lookup /python_race
Apr 24 11:58:47 build-2 kernel: RPC: xs_tcp_send_request(168) = 0
Apr 24 11:58:47 build-2 kernel: NFS: release(bin/python3.11)
Edit:
Have also found it does not occur if you add a pool.join()
prior to the terminate()
This isn't really an answer-answer, as much as confirmation that it is, in fact, a kernel bug, not a python issue - it just so happens that python was an easy way to 'trigger' it.
Logged with RedHat: https://issues.redhat.com/browse/RHEL-35853
If anyone's hitting the same problem, we found that adding:
pool.join()
into the code prior to pool.terminate()
seems to stop it occurring, but give the issue is in kernel space, we're not confident that it's anything more substantive than "just" a short delay in the code, which masks the implicit race condition.
We've found that a sleep(0.05)
likewise seems to prevent it happening, presumably for similar reasons, as there's no way userspace should be able to invalidate the directory entry caching in the first place.