What is the difference between an MPI nonblocking collective write, iwrite_all vs a "nonblocking" noncollective iwrite combined with a file sync?

I'm setting up IO for a largescale CFD code using the MPI library and the file IO is starting to eat into computation time as my problems scale.

As far as I can find the "done" thing in the modern context is heavy utilisation of collective IO operations, (Performance of Parallel IO on ARCHER - whitepaper from 2015).

My problem is there appears to be three ways of calling a collective write:

MPI_File_write_all, blocking
MPI_File_iwrite_all, non blocking

and somewhat speculatively:

MPI_File_iwrite followed by a call to MPI_File_sync, nonblocking then blocking?

I say speculatively because the former call is explicitly non collective but the latter (which to my knowledge is what actually pushes the data to storage) is collective.

My question is are multiple MPI_File_iwrites followed by a MPI_File_sync equivalent to a MPI_File_write_all, in that the file sync makes the non collective write effectively collective? Edit - for clarity here I am aware sync is a collective routine im asking whether the IO that happens when sync is called is analagous to the collective IO of a write_all.

followup: does an MPI_File_iwrite_all call require an MPI_File_sync call, and if it does what is the purpose of a collective non blocking write if it just becomes blocking down the line?

I'm focusing quite a bit on blocking vs non-blocking here because I'm trying to fully remove all synchronisation from my code to improve CPU utilisation (ie processes only wait if they lack the information they need from their neighbours, as opposed to waiting for all process to sync up) but obviously this becomes somewhat problematic when it comes to outputting.

Solution

Your question concerns three orthogonal MPI concepts: local completion of operations, process synchronization, and data consistency.

The main difference of blocking vs. non-blocking concerns the process-local state of the operation. A blocking operation completes before return from the blocking call; a non-blocking operation completes with a successful completion call. Until the operation completes locally, the MPI library "owns" the buffers you pass into a function.

Only a small subset of MPI functions imply synchronization. Especially collective communication does not necessarily imply synchronization.

Completion of File-IO functions does not establish data consistency (or global visibility of the operation's impact).

MPI_File_sync establishes data consistency for file accesses. It is only necessary, if data written to a file should be visible to a successive read from a different process. Example 14.6 in MPI-4.1 points out that actually a sequence equivalent to MPI_File_sync + MPI_Barrier + MPI_File_sync is necessary to establish data consistency between writing and reading from a file. The reason is that MPI_File_sync is collective but not synchronizing.

Whether you need MPI_File_sync at all depends an how your application accesses the file. If you need MPI_File_sync, you need it independent of the flavor of write call. You will need it with collective write and the non-collective write functions. Using non-blocking writes, you need to locally complete (test/wait) all active File-IO operations for the file handle before you can call MPI_File_sync.

Another aspect that determines the necessary calls for File-IO is the requirement to close all files (if they are opened in the world-model) before MPI_Finalize. Closing the file implies MPI_File_sync. This means that you will need to wait for all IO-requests before closing the file. Finally, if you want to reuse the buffer you pass into the nonblocking File-IO call, you will also need to wait for the IO-request at some point.