I would like to overlap computation with I/O operations.
The application does this loop:
Finally, send the last block of data with RDMA write with immediate to notify the remote side.
My question is the following: When the remote side will see the RDMA write with immediate after polling, will all previous RDMA writes be completed? (So the entire data is available on the remote). If not, is it possible to use fences, or any other tool, to achieve it?
Yes, when the remote side sees the completion of the RDMA Write with Immediate, all previous RDMA Write operations (on the same QP) are guaranteed to be completed. The IB spec says: "A responder shall execute ... RDMA WRITE requests ... in the message order in which they are received" so all previous RDMA Writes will be fully executed before the one with immediate data is executed.
And just to be really precise, even for the last RDMA Write, the spec says "The Immediate data is not written ... is passed to the client after the last RDMA WRITE packet is successfully processed." So the completion with immediate data will not be generated until the very last RDMA Write is fully executed as well.
Fence is only required in other situations, for example where the response to an earlier RDMA Read operation might be affected by an Atomic operation sent later.