This is a sporadic issue that I could not figure out a condition to replicate.
The gist of the issue is that instance/controller node will randomly fail to find files that are already created on Amazon FSx. A sample script can be as simple as this:
import dask
fn = '/mnt/fsx/home/user/something.txt'
def run():
with open(fn) as f:
s1 = f.readlines()
with open(fn) as g: //<-- it is possible that this line can fail to read the file
s2 = f.readlines()
return len(s1) + len(s2)
with open(fn, 'w') as f:
f.write('balh blah blah')
ret = [dask.delayed(run)() for _ in range(2000)]
result = dask.compute(ret)
It is possible for the 2nd open(..) in run() to fail with the simple python FileNotFoundError.
I could not find any information on why this could happen and how I can mitigate this. I did consider having the file on S3 so that there is built-in retries around the file access, but that can incur different load and cost issues.
Unfortunately, the question was misdiagnosed as this was not an FSX issue.
Somewhere in the paths there were symlinks which were based on a different shared drive. That was the shared drive that failed us. FSX was resilient throughout.