x86sseavxmmx

Are different mmx, sse and avx versions complementary or supersets of each other?


I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant.

The x86 architecture has accumulated a lot of math/multimedia extensions over decades:

Are the newer ones supersets of the older ones and vice versa? Or are they complementary?

Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE".

Are some of them mutually exclusive? I.e. do they share the same hardware parts?

Which should I use together to maximize hardware utilization on modern Intel / AMD CPUs? For sake of argument, let's assume I can find appropriate uses for the instructions... heating my house with the CPU if nothing else.


Solution

  • I recently updated the tag wikis for SSE, AVX, and x86 (and SSE2, avx2). They cover a lot of this. tl;dr summary: AVX rolls up all the previous SSE versions, and provides 3-operand versions of those instructions. Also 256b versions of most FP (AVX) and int (AVX2) insns.

    For summaries of the various SSE versions, see wikipedia, or knm241's more-detailed answer.

    We don't really think of that making SSE obsolete. More like, think of AVX as a new and better version of the same old SSE instructions. They're still in the ref manual under their non-AVX names (PSHUFB, not VPSHUFB, for example.) You can mix AVX and SSE code, as long as you use VZEROUPPER when needed to avoid the performance problem from mixing VEX with non-VEX insns (on Intel, and some more recent AMD). So there is some annoyance to dealing with cases where you have to call into libraries that might run non-VEX SSE instructions, or where your code uses SSE FP math, but also has some AVX code to be run only if the CPU supports it.

    If CPU-compatibility was a non-issue, the legacy-SSE versions of vector instructions would be essentially obsolete, like MMX is now. AVX/AVX2 is at least slightly better in every way, if you count the VEX-encoded 128b version as AVX, not SSE, except for code-size and some microarchitectural details1. Sometimes you'd still use 128b vectors because your data only comes in chunks that big, but more often working with 256b registers to do the same op on twice as much data at once. (If legacy-SSE didn't exist or was never used, vzeroupper / vzeroall might not be needed, unless it makes context-switches cheaper to clean the vector state.)

    SSE/AVX/x87-FP/integer instructions all use the same execution ports. You can't get more done in parallel by mixing them. (except on Haswell and later, where one of the 4 ALU ports can only handle non-vector insns, like GP reg ops and branches).

    Actually, AMD also uses separate schedulers and execution ports for scalar-integer vs. vector stuff, see for example a Zen 2 block diagram. With 5-per-clock instruction throughput of up to 6-per-clock uops, there is some front-end bandwidth left in loops that max out all four of Zen's SIMD execution units. Usually that's load/store instructions and loop overhead (GP integer), like on Intel where there are 3 ports with SIMD ALUs/FPUs, and some more ports that can handle scalar/vector loads and sores, and scalar integer stuff.


    Footnote 1: Corner cases where AVX has some disadvantages: