linuxmultithreadingprocessipcdbus

What causes inter-process communication to take millions of cycles?


In the best case, Inter-process communication is obviously slower than communication between threads, as threads share resources, such as the heap.

However, why is inter-process communication several orders of magnitude slower? I'm not interested in whether one should be using threads or processes in a certain situations, but instead why this is the case at all.

For example, I've made several requests over DBus to get small amounts of data (less than 16 bytes), and each request takes at least 1ms before a response is received. The exact length is not important, as this is a low performance device, however on a Ghz processor, I wouldn't expect the operation to take on the order of millions of cycles. By contrast, the performance of reading shared heap data between threads is so insignificant that it doesn't show up on any traces I capture.

BONUS: I've seen mentions in passing that DBus is the most performant way to send messages between processes on Linux, and that other alternatives perform worse. Why is this?


Solution

  • In the best case, Inter-process communication is obviously faster than communication between threads, as threads share resources, such as the heap.

    This is not true. It is actually often the opposite.

    First of all, we first need to define what "faster" means. Indeed, if we measure the latency of operations, then inter-process communication (IPC) generally has a higher latency simply often because of system calls (typically needed to ensure the synchronization between processes, to be able to share memory, create shared files, etc.). Regarding the throughput, well, it depends of the actual method used to communicate between processes. IPC can be equally fast (and in rare cases, even faster regarding the exact method used).

    The thing is there is not one method to communicate between processes but many. Files, sockets and pipes are famous examples, but IPC is not limited to that. Shared memory is also a form of IPC and it is actually a rather efficient one. Some message-passing solutions are known to be particularly fast. A good example is the Message-Passing Interface (MPI) which is massively used in distributed scientific/high-performance-computing applications (and it is still pretty efficient on a single computing node though not always optimal).

    Moreover, results are pretty dependent of the target platform (including the hardware and the kernel version) and how things are exactly done for a given method. For example, Unix domain sockets can be significantly faster than TCP ones (both latency and throughput) and be sometimes competitive with pipes although both are sockets. Another example: if you tune TCP sockets (eg. disable the Nagle's algorithm for local IPC), then the latency can be significantly lower. Yet another example: the performance between one MPI implementation and another (eg. Intel MPI versus OpenMPI) can be very different for a given use-case and drastically change for another use-case. Implementation are optimized for a set of specific use-cases.

    Message passing between processes often require a memory copy which is not great for communicating a large amount of data. However, this method can sometimes help to reduce synchronizations between processes which can be a significant bottleneck. This is not so rare in high-performance computing applications running on many-core systems (on which synchronizations are very expensive). It can also help to reduce NUMA effects which are hard to avoid with multi-threaded shared-memory solutions. In this case, a memory copy can be better than accessing data from a remote NUMA node in a non-contiguous way.

    Regarding shared memory, this is a bit complicated. When two processes A and B share the same memory area and A write into it and then B read it, the first access for both A and B is significantly slower (for each memory page). This is because pages are not loaded in the translation lookaside buffer (TLB) of both A and B initially but also because AFAIK Linux perform demand paging (ie. it maps virtual pages physically during the first touch). This overhead is pretty significant for sequential reads/writes, especially on computing-servers with a pretty-high memory bandwidth(eg. 10 GiB/s versus 100 GiB/s). AFAIK, subsequent accesses are fast though (similar to two threads of the same process sharing accessing shared memory of their own process). When a small amount of memory is shared, data can be stored in the CPU cache so accesses are significantly faster (both latency and throughput). However, on mainstream x86-64 CPUs, if the two processes do not run on the same core, data will be read/store from/to the shared L3 cache (since the L1/L2 are generally not shared). When there is a context-switch, AFAIK the TLB can be flushed (often only partially), mainly for security reasons, regarding the exact CPU used (and certainly the Linux version too*). I think having many ready processes tends to cause more flushes than the same amount of ready threads (sharing the same process). Flushing the TLB has a significant impact, not only because it needs to be filled again, but also because the L1/L2 cache can be indirectly flushed because of that (resulting in cache misses). For more information about, see this. Note that huge-pages can help to reduce such overhead for relatively-large shared-memory area (eg. at least few MiB).

    For example, I've made several requests over DBus to get small amounts of data (less than 16 bytes), and each request takes at least 1ms before a response is received.

    I personally do not know much DBUS. 1 ms seems indeed pretty slow. You should first check if this is not due to a simple timeout limit.

    The speed of D-BUS is dependent of several parameters. The major one is the implementation (there are multiple implementations). Regarding the default/standard implementation, AFAIK libdbus, the presence of an intermediate daemon process can slow things down. Using kdbus instead of Unix domain sockets certainly reduce the latency of the communications on kernels supporting it (it also avoid using the intermediate daemon process). More information is provided on the kdbus main page. However, note that kdbus is apparently obsolete (see the comment of grawity_u1686). The systemd's implementation, sd-bus, is apparently much faster than libdbus and is certainly the way to go. To quote Wikipedia:

    "In preliminary benchmarks, BMW found that the systemd's D-Bus library increased performance by 360%" (from en.wikipedia.org/wiki/D-Bus)

    Thus, I strongly advise you to check which implementation you use on your target system.

    It may also be worth reading this post: D-Bus: Performance improvement practices. Profiling is the key to understand what is exactly happing and to fix this. perf sched may also help to track issues related to context-switches or even the IO scheduler. If this is not enough and kdbus is the culprit, then you can profile the kernel directly with ftrace. A 1 ms overhead should be quite easy to track. You can find an example of analysis using such tools here.

    I've seen mentions in passing that DBus is the most performant way to send messages between processes on Linux, and that other alternatives perform worse.

    It depends on your needs, but if you just want to send few bytes from one process to another without any high-level features (nor any standardized interface), then sending data using shared-memory between two processes is certainly much faster (far lower latency). The thing is you need to synchronize the two processes so to avoid race condition. This synchronization can be quite expensive (but definitively far less than 1 ms if the threads of the two processes communicating are scheduled).


    * typically due to security patches (eg. Spectre, Meltdown) intended to mitigate vulnerabilities of some CPUs. For example, when two different processes runs on the same core on Skylake CPU with SMT enabled.