cfilesystemsfortranparallel-processingembarrassingly-parallel

What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?


We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.

However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.

The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.

The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files)

What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.

Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).

I have had some thoughts on the matter:

  1. Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
  2. Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job

In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).

In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.


Solution

  • Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.

    The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:

    1. If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem.
    2. Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:
    3. Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.
    4. Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.
    5. If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.

    The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.

    Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.