pythonamazon-web-servicesamazon-ec2daskamazon-fsx

AWS instance/controller node randomly unable to find files on FSX that is there


This is a sporadic issue that I could not figure out a condition to replicate.

The gist of the issue is that instance/controller node will randomly fail to find files that are already created on Amazon FSx. A sample script can be as simple as this:

import dask

fn = '/mnt/fsx/home/user/something.txt'

def run():
  with open(fn) as f:
    s1 = f.readlines()
  with open(fn) as g: //<-- it is possible that this line can fail to read the file
    s2 = f.readlines() 
  return len(s1) + len(s2)

with open(fn, 'w') as f:
  f.write('balh blah blah')

ret = [dask.delayed(run)() for _ in range(2000)]

result = dask.compute(ret)

It is possible for the 2nd open(..) in run() to fail with the simple python FileNotFoundError.

I could not find any information on why this could happen and how I can mitigate this. I did consider having the file on S3 so that there is built-in retries around the file access, but that can incur different load and cost issues.


Solution

  • Unfortunately, the question was misdiagnosed as this was not an FSX issue.

    Somewhere in the paths there were symlinks which were based on a different shared drive. That was the shared drive that failed us. FSX was resilient throughout.