I am curious about the vastly different performance characteristics of running x86-64 binaries on the Apple M1 platform using Rosetta 2 vs. emulation, for example what Docker Desktop currently does using QEMU.
I understand why emulation is so slow, but an explanation for why Rosetta 2 is so fast has been detailed in this Twitter thread: https://twitter.com/ErrataRob/status/1331735383193903104
The gist of that explanation is that under usual circumstances, arm and x86 have opposite (and incompatible) memory addressing schemes which require significant emulation overhead, but the M1 chip addresses this with a hardware optimization that allows it to access memory using both addressing schemes. Effectively, when Rosetta 2-emulated instructions are being run, a flag is set to let the processor know to use the x86-style addressing scheme.
Assuming this explanation is reasonable (and if anyone has better-sourced reporting than the above Twitter thread I would appreciate it in the comments for inclusion), is it technically plausible that this optimization could be leveraged for full hardware emulation, for example running x86-64 Linux Docker containers, or running a full x86-64 Windows desktop virtual machine a la VMware Fusion/VirtualBox? Or, does the separate operating system layer in those scenarios preclude being able to leverage the memory ordering optimization?
Separately, is this processor mode (flags or instructions) documented and published for 3rd-party use, or is it private to Apple only?
Apple released support for Rosetta 2 Linux VMs using the Virtualization framework, perhaps sometime around June 2022:
Also, Docker Desktop for Mac released a beta feature for x86 containers using this feature in version 4.16.0 on 2023-01-12.
Not memory addressing, memory ordering. i.e. for lock-free atomics used for inter-thread synchronization - in x86, every asm load/store is acquire/release respectively. (With real x86 CPUs doing speculative early loads so performance doesn't suffer under normal conditions when a single thread is operating on memory that other threads aren't writing.)
M1 has hardware support for a mode like that, as well as a weakly-ordered mode to run native AArch64 code most efficiently. See
And yes, https://github.com/saagarjha/TSOEnabler is open-source software to toggle that support. But it's a kernel extension, and code signing makes it tricky to get MacOS to allow it, and you have to sign it or disable signature verification or something:
Supposedly, you should be able to use this if you build and sign the kernel extension (disabling SIP if you aren't using a KEXT signing certificate) and drag it into /Library/Extensions. A dialog should come up to prompt you to enable the extension in the Security & Privacy preferences pane, you allow it from there and restart, and it will be installed. (If you're not seeing it, the permissions on the extension might be wrong: try a chown -R root:wheel.) In practice this can go wrong in many ways, and I have had luck by "resetting everything" and trying to install after doing the following:
[...] see the link for the list of steps
So yes, it's plausible that QEMU's x86 emulation could use the same hardware support that Rosetta-2's x86 emulation does. They're both x86 emulators.
And as you say, the main issue is Apple providing public APIs for enabling the HW mode so people don't have to install this kernel module manually; I'm sure most people wouldn't want to do that. I don't know much about the software situation, I was more interested in the CPU-architecture details.