What is the point of MOVAPS in x86 if it does the same as MOVUPS in modern computers?

I was coding a memset function in an embedded system, and I found the fastest way was using movups. Given my memory was already aligned, I decided to use movaps to get faster & smaller results. After lots of tries, they both gave exact system times in performance, and they both had same opcode sizes. So what is the point of movaps if it requires an aligned memory and still only gives same performance?

From Intel Developer Manual:

MOVAPS: Move four aligned packed single precision floating-point values between XMM registers or between an XMM register and memory.

MOVUPS: Move four unaligned packed single precision floating-point values between XMM registers or between an XMM register and memory.

And similarly, what is the purpose of: MOVAPD, MOVUPD, MOVDQA, MOVDQU (They're all doing the same thing in practice)

Question in reddit: reddit

Solution

These days, the only remaining use is as an assert that your data is actually aligned. Cache-line splits and especially page-splits aren't as fast so you might want to verify alignment instead of just letting the hardware make slower accesses than you were expecting.

Historically (before Nehalem and Bulldozer-family, around 2008), movups was always slower even for aligned addresses, e.g. decoding to more uops and with worse throughput even if you weren't bottlenecked on the front-end. (AMD K10 had efficient movups loads but not stores.) See https://agner.org/optimize/ for instruction tables (of uops and throughput) that include those old CPUs.

SSE1 was new in Pentium III, launched 1999, and designed on paper in the years before that, when transistor budgets were much smaller. Handling 32-bit unaligned loads / stores with full performance (as long as they weren't split across a cache-line) was something they could manage, but wasn't something they wanted to spend transistors on for 128-bit loads where it would take 4x the width of muxers.
In the early days of SIMD, it wasn't as widely used either so the benefit (for users) wouldn't have been as large for the CPU. With x86-64, SSE2 was a baseline feature that all software could assume, but most 32-bit code couldn't.

In the early days, compiler auto-vectorization was much more limited, so only a few programs or libraries had manually-vectorized code. With SIMD mostly only getting used in programs designed around using it, an alignment requirement was usually not a big limitation.

Note, with C intrinsics, _mm_load_ps as opposed to _mm_loadu_ps lets the compiler fold the load into a memory source operand for other instructions, like addps xmm0, [rdi]. An alignment requirement was the default for SSE, unlike for AVX (vaddps xmm0, xmm0, [rdi]). With AVX, communicating alignment guarantees to the compiler was useful only for tuning choices (notably GCC's default -mtune=generic before GCC10 or 11 or so favoured Sandybridge long after it was obsolete). But intrinsics are separate from the movaps asm instruction having a purpose.

Some modern compilers, MSVC and ICC-classic, never use movaps except for copying registers. They use movups for _mm_load_ps/store unless they fold the load into a memory source operand for another instruction, so that's the only case you actually get alignment checking. And they started doing this a long time ago, before Core 2 systems were so thoroughly obsolete; the binaries they make are slower on those old systems.