infinibandrdma

Multithreaded use of a single QP vs multiple QPs to improve throughput


I am using RDMA writes in my application and want to improve throughput.

Currently, I have a single thread using my queue pair. I was wondering, what is a more standard way (or what are the advantages of each):

  1. Creating more connections with the remote node (so multiple queue pairs) and load balancing my traffic across them
  2. Using multiple threads on ibv_post_send on the single QP?

Thank you!


Solution

  • All libibverbs APIs are thread-safe, so having multiple threads post to a single QP is obviously not a safety issue. That said, the concurrency is being handled somewhere along the stack, and it may have synchronization costs that outweigh the threading benefits.

    In general, having a QP per core should be more performant. Multiple QPs are also able to extract parallelism within the NIC (not just the CPU). It's hard to make a blanket statement across NICs and drivers I think, as QPs also take up NIC SRAM, and the amount available varies. That should only be a concern if you go for an extremely large number of QPs though, not with 1 QP/core or some number in that range.

    There are other things you can consider to improve your application throughput:

    1. You can also reconsider your application design. Larger messages are much more efficient than smaller messages if you want to achieve line rate. Can you batch the data you're sending into larger buffers?

    2. If the communication thread is also doing some compute for each message, that's cycles diverted from the communication. Can you separate out the compute into its own thread? The answer is not always yes - if your compute kernel is tiny enough the cost of inter-thread synchronization can exceed the benefits of offloading it to a separate thread.