performancex86cpu-architectureavx2amd-processor

Intel vs AMD gather AVX performance


I noticed that some of AVX instruction on Zen2 have ridiculously high μops cost compared to their Intel counterparts. According to μops table:

VPGATHERDD Skylake Zen3
Latency (clocks) [0;22] [0;28]
Reciprocal TP (Measured) 5.00 8.00
μops (Measured) 5 39

These numbers look like something that can affect gather performance. This question is similar to some old scalar vs gather questions, but those question was more about Intel and didn't discussed Excavator/Zen μops gather cost at all. Maybe it's because AMD CPUs weren't popular at that time, but today it's more relevant. The only explanation to such a big difference I found is some random comment claiming gathers are microcoded in AMD CPUs. I didn't find any additional explanation neither in Agner Fog nor in AMD programming guidance.

I tried to make some benchmark* on Zen3, Skylake and Broadwell processors to see how scalar load compares to gather.

Broadwell Skylake Zen3
1x 1x 1x
1.5-2.1x 3.1-6x 1x

Difference in throughput should have made about 1.6x (8/5) difference in favor of Intel. How much can be accounted to difference in μops?

Can big μops cost hurt out-of-order execution when mixed with real code or this is highly unlikely since Zen processors have big μops cache? Is there any better benchmark for this?

*Initial benchmark was wrong; link and numbers in the table are fixed now.


Solution

  • The biggest effect of being many uops is on how well it can overlap with surrounding code (e.g. in a loop) that isn't the same instruction.

    If a gather is nearly the only thing in a loop, you're mostly going to bottleneck on the throughput of the gather instruction itself, whichever part of the pipeline it is that limits gathers to that throughput.

    But if the loop does a lot of other stuff, e.g. computing gather indices and/or using the gather result, or fully independent especially scalar integer work, it might run close to a front-end bottleneck (6 uops per clock cycle issue/rename on Zen 3), or a bottleneck on back-end ALU ports. (AMD has separate integer and FP back-end pipelines; Intel shares ports, although there are a few extra execution ports that only have scalar integer ALUs.) In that case, it would be the uops cost of the gather that contributes to the bottleneck.

    Other than branch misses and cache misses, the 3 dimensions of performance are front-end uops, back-end ports it competes for, and latency as part of a critical path. Notice that none of these are the same as just running the same instruction back-to-back, the number you get from measuring "throughput" of a single instruction. That's useful to identify any other special bottlenecks for those uops.

    Some uops may occupy a port for multiple cycles, e.g. some of Intel's gather loads are fewer uops than the total number of elements, so they might stop other loads from dispatching at some point, creating more back-end port pressure than you might expect from the number of uops for each port. FP divide/sqrt is like that, too. But since AMD's gathers are so many uops, I'd hope that they're all fully pipelined.

    AMD's AVX1/2 masked stores are also a ton of uops; IDK how exactly they emulate that in microcode if they don't have efficient dedicated hardware for it, but it's not great for performance. Maybe by breaking it into multiple conditional scalar stores.

    Bizarrely, Zen 4's AVX-512 masked stores like vmovdqu32 (m256, k, ymm) are efficient, single uop with 1/clock throughput (despite being able to run on either store port, according to https://uops.info/; Intel has 2/clock masked store throughput same as regular stores, since Ice Lake.) If the microcode for vpmaskmovd would just compare into a mask and use the same HW support as vmovdqu32, it would be way more efficient. I assume that's what Intel does, given the uop counts for vmaskmovps.

    See also


    highly unlikely since Zen processors have big μops cache?

    It's not about caching the uops, it's about getting all those uops through the pipeline every time the instruction runs.

    A uop with more than 2(?) on AMD, or more than 4 on Intel, is considered "microcoded", and the uop cache just stores a pointer to the microcode sequencer, not all the uops themselves. This mechanism makes it possible to support instructions like rep movsb which run a variable number of uops depending on register values. On Intel at least, a microcoded instruction takes a whole line of the uop cache to itself (See https://agner.org/optimize/ - especially his microarchitecture guide.)