assemblyx86intelcpu-architecturemessage-passing

Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?


After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to pass through not only L1 cache (if on same core SMT) but also some atomic/mutex synchronization primitive procedure that is not accelerated by hardware.

I wonder why didn't Intel or IBM come up with something like:

movcor 1 MX 5 <---- sends 5 to Messaging register of core 1
pipe 1 1 1 <---- pushes data=1 to pipe=1 of core=1 and core1 needs to pop it
bcast 1 <--- broadcasts 1 to all cores' pipe-0 

to make it much faster than some other methods? GPUs support block-wise fast synchronization points, like barrier() or __syncthreads(). GPUs also support parallel atomic update acceleration for local arrays.

When CPUs gain 256 cores, won't this feature enable serious scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?


Solution

  • CPUs evolved for a very different programming model than GPUs, to run multiple separate threads, potentially of different processes, so you'd also need software and OS infrastructure to let threads know which other core (if any) some other thread was running on. Or they'd have to pin each thread to a specific core. But even then it would need some way to virtualize the architectural message-passing register, the same way context switches virtualize the standard registers for multi-tasking on each core.

    So there's an extra hurdle before anything like this could even be usable at all under a normal OS, where a single process doesn't take full ownership of the physical cores. The OS is still potentially scheduling other threads of other processes onto cores, and running interrupt handlers, unlike a GPU where cores don't have anything else to do and are all build to work together on a massively parallel problem.

    Intel did introduce user-space interrupts in Sapphire Rapids. Including user IPI (inter-processor interrupt), but that doesn't involve a receive queue that would have to get saved/restored on context switch. The OS still has to manage some stuff (like the User Interrupt Target table), but it's not as problematic for context switches, I think. It's solving a different problem than what you're suggesting, since it's an interrupt not a message queue.

    Notifying another thread when to look in shared memory for data is the hard part of the problem that needed solving, moreso than getting data between cores. Shared memory is still ok for that (especially with new instructions like cldemote to let the writer ask for a recently-stored cache line to be written back to shared L3 where other cores can read it more efficiently). See the section below about UIPIs.


    A task that wants something like this is usually best done on a GPU anyway, not a few separate deeply pipelined OoO exec CPU cores that are trying to do speculative execution. Unlike GPUs that are simple in-order pipelines.

    You couldn't actually push a result to another core until it retires on the core executing it. Because you don't want to have to roll back the other core as well if you discover a mis-speculation such as a branch mispredict earlier in the path of execution leading to this. That could conceivably still allow for something lower-latency than bouncing a cache-line between cores for shared memory, but it's a pretty narrow class of application that can use it.

    However, high-performance computing is a known use-case for modern CPUs, so if it was really a game-changer it would be worth considering as a design choice, perhaps.

    Some things aren't easy or possible to do efficiently given the architecture of existing CPUs. Small units of work with fine-grained cooperation between threads is a problem. Your idea might help if it was practical to implement, but there are major challenges.


    Inter-processor Interrupts (IPI), including user-IPI

    For OS use, there is the IPI mechanism. But that triggers an interrupt so it doesn't line up data to be read when the receive side's out-of-order exec pipeline gets to it, so it's very different from the mechanism you're suggesting, for different use-cases.

    It's quite low-performance except to avoid polling by the other side. And to be able to wake up a core from a power-saving sleep state, if more threads are now ready to run so it should wake up can call schedule() to figure out which one to run.

    Any core can send an IPI to any other, if it's running in kernel mode.

    New in Sapphire Rapids, there is hardware support for the OS letting a user-space process handle some interrupts fully in user-space.

    https://lwn.net/Articles/869140/ is an LKML post explaining it and how Linux could support it. Apparently it's about 10x faster than "eventfd" for ping-ponging a tiny message between two user-space threads a million times. Or 15x faster than doing the same with a POSIX signal handler.

    Kernel managed architectural data structures

    • UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector information and notification state (like an ongoing notification, suppressed notifications).
    • UITT: User Interrupt Target Table - Stores UPID pointer and vector information for interrupt routing on the sender side. Referred by the senduipi instruction.

    The interrupt state of each task is referenced via MSRs which are saved and restored by the kernel during context switch.

    Instructions

    • senduipi <index> - send a user IPI to a target task based on the UITT index.
    • clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
    • stui - Unmask user interrupts by setting UIF.
    • testui - Test current value of UIF.
    • uiret - return from a user interrupt handler.

    So it does have new state to be saved/restored on context switch. I suspect it's smaller than the queues you might have been picturing, though. And critically, doesn't need a receive queue anywhere for threads that aren't running, because the state involves tables in memory and there's no data, just a pending or not flag I guess. So in the not-running receiver case, it can just set a bit in a table that senders need to be able to see anyway, for the HW to know where to direct the UIPI. Unlike needing to find the kernel's saved register-state or some other space and append to a variable-size(?) buffer for your idea.

    If the receiver is running (CPL=3), then the user interrupt is delivered directly without a kernel transition. If the receiver isn't running the interrupt is delivered when the receiver gets context switched back. If the receiver is blocked in the kernel, the user interrupt is delivered to the kernel which then unblocks the intended receiver to deliver the interrupt.

    So data still has to go through memory, this is just about telling another thread when to look so it doesn't have to poll / spin.
    I think UIPIs are useful for different use-cases than your proposed message queue.

    So you still wouldn't generally use this when the receiving thread knows that specific data is coming soon. Except maybe to let a thread be working on something independent instead of spin-waiting or sleeping.

    It's also usable if the thread doesn't specifically expect data soon, unlike your queue idea. So you can have it working on something low priority, but then start work that's part of the critical path as soon as more of that is ready.

    It's still an interrupt, so significant overhead is still expected, just a lot less than going through the kernel for a signal handler or similar. I think like any interrupt, it would need to drain the out-of-order back-end. Or maybe not even that bad, maybe just treat it like a branch mispredict since it doesn't have to change privilege level. Discarding instructions in the ROB would be lower interrupt latency, but worse throughput and just re-steering the front-end to the interrupt-handler address.


    Incorrect scalability assumptions in your question

    scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?

    Mesh interconnects (like Intel since Skylake Xeon) allow pretty large aggregate bandwidth between cores. There isn't a single shared bus they all have to compete for. Even the ring bus Intel used before Skylake-Xeon, and still uses in client chips, is pipelined and has pretty decent aggregate bandwidth.

    Data can be moving between every pair of cores at the same time. (I mean, 128 pairs of cores can each have data in flight in both directions. With some memory-level parallelism, a pipelined interconnect can have multiple cache lines in flight requested by each core.)

    This involves shared L3 cache, but typically not DRAM, even across sockets. (Or on AMD where clusters of cores are tightly connected in a CCX core complex, between those within the same die).

    See also some Anandtech articles with good benchmarks of inter-core latency (cache-line ping-pong)


    GPUs also support parallel atomic update acceleration for local arrays.

    I think I've heard of some CPUs (at least in theory, maybe practice) allowing fast memory_order_relaxed atomics by putting a simple ALU into the shared cache. So a core can send an atomic-increment request to shared L3 where it happens on the data there, instead of having to get exclusive ownership of the line temporarily. (With the old value being returned read-only to allow a fetch_add or exchange return value).

    This wouldn't be easily ordered wrt. other loads or stores on other locations done by the core that sent the atomic operation request out to be done by the cache.

    Anandtech's Graviton2 review shows a slide that mentions "Dynamic near vs. far atomic execution". That might be a form of this idea! Perhaps allowing it to execute remotely (perhaps at the core owning the cache line?) if memory ordering requirements are weak enough to allow it for this instruction? That's just a guess, that's a separate question which I won't dig into further here.

    (ARMv8.1 with "large system extensions" provide x86-style single-instruction atomic RMWs, as well as the traditional LL/SC that require a retry loop in case of spurious failure, but can synthesize any atomic RMW operation like you can with a CAS retry loop.)