[SOLVED] Why can't QEMU get even close to Rosetta 2's performance when translating x86 to M1?

Why can't QEMU get even close to Rosetta 2's performance when translating x86 to M1?

Apparently, QEMU is the only piece of open source code that can emulate an x86 operating system on the new Apple silicon (M1, M2, etc.).

Apple built Rosetta 2, which, in theory, does the exact same thing that QEMU would be doing in these scenarios. It translates x86 (Intel) instructions into the instruction set supported by the new Apple silicon processors.

Rosetta 2 does it with remarkable performance, and some x86 applications even run with better performance than on native x86 hardware. QEMU, on the other hand, doesn't get even close when running x86 Linux on Apple silicon.

How can Rosetta have such superior performance? Are there any "secrets" that only Apple knows about their architecture that were never shared with the QEMU project? Any forbidden APIs that QEMU is not allowed to access?

Solution

Rosetta and QEMU are both emulators. However, they tackle the problem in vastly different ways.

QEMU

In order to emulate a a Linux system, QEMU must also emulate storage devices, console output devices, ethernet devices, keyboards, and the entire CPU. With this framework, it emulates every instruction doing everything with Just in Time translation. From the Linux kernel down to your /bin/ls command.

There are generally few limitations to QEMU's Intel emulation. You can run most any Intel Operating System and associated applications.

Rosetta 2

Apple's emulation, on the other hand, happens before the application launches. The entire binary is translated from x86 to Apple Silicon and launched. Once translated, the application is in effect a native arm64 binary making native macOS system calls.

Apple's documentation explains it thus:

If an executable contains only Intel instructions, macOS automatically launches Rosetta and begins the translation process. When translation finishes, the system launches the translated executable in place of the original. However, the translation process takes time, so users might perceive that translated apps launch or run more slowly at times.

Additionally, Apple Silicon has an option to run code translated from Intel using identical memory-ordering as Intel:

4/ So Apple simply cheated. They added Intel's memory-ordering to their CPU. When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering. — Robᵉʳᵗ Graham, provocateur (@ErrataRob) November 25, 2020

Rosetta 2 has a number of significant limitations. For example you can't use Intel Kernel extensions, Virtual Machine apps that virtualize x86_64 computer platforms (Parallels for example), or AVX/AVX2/AVX512 vector instructions.