performancex86cpucpu-architecturememory-bandwidth

Load/stores per cycle for recent CPU architecture generations


Inspired by this answer to

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

what are the numbers of just-loads/loads-and-stores which one could issue on a core - for Sandy/Ivy Bridge, Broad/Haswell, Sky/Kaby Lake? Also interesting are the numbers of AMD Bulldozer, Jaguar and Zen.

PS - I know that might not be a sustainable rate because of cache/memory bandwidths, I'm only asking about issues.


Solution

  • Based on information from:

    Architecture Loads Stores 256-bit counts as two?
    Sandy/Ivy Bridge 1 1 Yes
    Sandy/Ivy Bridge 2 0 Yes
    Haswell/Broadwell 2 1 No
    Skylake/Kaby Lake 2 1 No
    Ice/Tiger Lake 2 2 No (2 store/clock sustained only for same cache line)
    Alder Lake / Sapphire 3 1 No, but throughput may be "somewhat less"
    Alder Lake / Sapphire 2 2 No, but throughput may be "somewhat less"
    Bulldozer 1 1 Yes
    Bulldozer 2 0 Yes
    Jaguar 1 1 Yes
    Zen 1 1 1 Yes
    Zen 1 2 0 Yes
    Zen 2 2 1 No
    Zen 3/4 3 2 integer registers
    Zen 3/4 2 1 No, but fewer pipes handle vectors of any width

    Sandy/Ivy: per cycle, 2 loads, or 1 load and 1 store. 256bit loads and stores count double, but only with respect to the load or store itself - it still only has one address so the AGU becomes available again the next cycle. By mixing in some 256b operations you can still get 2x 128b loads and 1x 128b store per cycle.

    Haswell/Broadwell: 2 loads and a store, and 256bit loads/stores don't count double. Port 7 (store AGU) can only handle simple address calculations (base+const, no index), complex cases will go to p2/p3 and compete with loads, simple cases may compete anyway but at least don't have to.

    Sky/Kaby: the same as Broadwell

    Ice/Tiger Lake: 2 loads and 2 stores per clock, with fully separate execution units for each (store-address uops don't run on load ports.) 2/clock stores can only be sustained if stores are to the same cache line. i.e. 1/clock write to L1d cache, but a write can commit two store-buffer entries if they're to the same line. For memory-ordering reasons, the two store-buffer entries have to be the two oldest, so alternating stores to two separate arrays couldn't benefit from this unless you can unroll.

    Alder Lake / Sapphire Rapids: 3 loads and 1 store, or 2 loads and 2 stores. Agner Fog reports those throughputs for sizes up to 128-bit, but "somewhat less" for 256 and 512-bit loads/stores. Commit to L1d may be limited like Ice Lake for more than 1 store per clock.

    Bulldozer: 2 loads, or 1 load and 1 store. 256bit loads and stores count double.

    Jaguar: 1 load or 1 store, and 256bit loads and stores count double. By far the worst one in this list, because it's the only low-power µarch in the list.

    Zen 1 (first-gen Ryzen): 2 loads, or 1 load and 1 store. 256bit loads and stores count double.

    Zen 2 (Most Ryzen 3xxx and 4xxx, but there are some 3xxx models that are only Zen+ not Zen 2).
    3 AGUs (2 load/store, 1 store-only). Up to two 256-bit load operations and one 256-bit store per cycle.

    Zen 3: Load throughput increased from 2 to 3 for scalar integer GPRs.
    Store throughput increased from 1 to 2 for scalar integer GPRs. (Wikichip incorrectly states this as "if not 256-bit", but https://uops.info/ testing confirms only 1/clock vector stores even with 128-bit vmovaps [mem], xmm.

    Zen 4: no change from Zen 3. AVX-512 512-bit ops are single-uop, but occupy load and store-data units for 2 cycles each, like how Sandy/Ivy Bridge handled 256-bit load/store. (Same for 512-bit ALU uops, single uop unlike how Zen 1 handled 256-bit.)