cpuintelcpu-architectureavxflops# FLOPs per cycle for Sandy Bridge and Haswell and others SSE2 / AVX / AVX2 / AVX-512

### Intel

### AMD

### x86 low power

### ARM

### IBM POWER

### Intel MIC / Xeon Phi

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.

This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.

However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.

Can someone explain this to me?

Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

Solution

Here are theoretical max FLOPs counts (**per core**) for a number of recent processor microarchitectures and explanation how to achieve them.

In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply

`(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA)`

.

Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything *else*. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.

If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.

Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.

Intel Core 2 and Nehalem (SSE/SSE2):

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge (AVX1):

- 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
- 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):

- 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
- 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- (Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **1 FMA units**: some Xeon Bronze/Silver

- 16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
- 32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
- Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also
**reduces the max turbo clock speed**, so "cycles" isn't a constant in your performance calculations.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (**AVX512F**) with **2 FMA units**: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

- 32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- 64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
- (Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed, although much smaller penalty on Ice Lake and especially newer CPUs)

Future: Intel Cooper Lake (successor to Cascade Lake) introduced Brain Float, a float16 format for neural-network workloads, with support only for SIMD dot-product (into an f32 sum) and conversion of f32 to bf16 (AVX512_BF16). The current F16C extension with AVX2 only has support for load/store with conversion to float32. https://uops.info/ reports that the instructions are multi-uop on Alder Lake (and presumably Sapphire Rapids), but single-uop on Zen 4. Ice Lake lacks BF16, but it's found in Sapphire Rapids and later.

Intel chips before Sapphire Rapids only have actual computation directly on standard float16 in the iGPU. With AVX512_FP16 (Sapphire Rapids), math ops are native operations without having to convert to f32 and back. https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 . Unlike bf16 support, the full set of add/sub/mul/fma/div/sqrt/compare/min/max/etc ops are available for fp16, with the same per-vector throughput, doubling FLOPs.

AMD K10:

- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

- 8 DP FLOPs/cycle: 4-wide FMA on 128-bit execution units
- 16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen (Zen 1)

- 8 DP FLOPs/cycle: 2-wide or 4-wide FMA on 128-bit execution units
- 16 SP FLOPs/cycle: 4-wide or 8-wide FMA

AMD Zen 2 and later: 2 FMA/MUL units and two ADD units on separate ports

24 DP FLOPs/cycle: 4-wide FMA + 4-wide ADD on 256-bit execution units

48 SP FLOPs/cycle: 8-wide FMA + 8-wide ADD

with only FMAs like for a matmul, 16 DP / 32 SP FLOPs/cycle using 256-bit instructions (or 512-bit on Zen 4 which has single-uop but double-pumped 512-bit instructions.)

Zen 4 and later:

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

Intel Gracemont (Alder Lake E-core):

- 8 DP FLOPs/cycle: 2-wide or 4-wide FMA on 128-bit execution units
- 16 SP FLOPs/cycle: 4-wide or 8-wide FMA

AMD Bobcat:

- 1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

- 3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
- 8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

ARM Cortex-A9:

- 1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
- 4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

- 2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
- 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

IBM PowerPC A2 (Blue Gene/Q), per core:

- 8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
- SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

- 4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
- SP elements are extended to DP and processed on the same units

Intel Xeon Phi (Knights Corner), per core:

- 16 DP FLOPs/cycle: 8-wide FMA every cycle
- 32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

- 8 DP FLOPs/cycle: 8-wide FMA every other cycle
- 16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

- 32 DP FLOPs/cycle: two 8-wide FMA every cycle
- 64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

- "Pentium-safe FDIV" ... in year 2014?
- How to obtain the number of CPUs/cores in Linux from the command line?
- Continuously monitors the CPU usage % of top X processes
- Do CPUs have a hardware "math cache" or dictionary that stores the result of simple math operations for quicker processing?
- How can I get the number of CPU cores in Powershell?
- Is there hardware support for 128-bit integers in modern processors?
- Why not just predict both branches?
- How to run bitwise OR on big vectors of u64 in the most performant manner?
- CPU wait time on Linux system
- CPU Usage not updating
- How does the CPU interact with the monitor?
- Use two loop bodies or one (result identical)?
- Android Studio Emulator options for Ryzen CPU
- How can I get the CPU temperature in Rust?
- How does a computer draw the screen?
- how does the accessed bit microcode assist work?
- Running a job on CPU by default, but on GPU when available in Slurm
- How to monitor usage of CPU when function is called in Python psutil?
- Does SIMD require a multi-core CPU?
- how to open many tabs in chromium but unload/disable inactive/notCurrent ones, releasing memory and cpu?
- Is integer multiplication really done at the same speed as addition on a modern CPU?
- High CPU by circle animation with gradient
- Are threads run on CPU or core?
- What happens with the store "that lost race" to shared memory in x86 TSO memory model?
- Can‘t Lock Android MTK 8050 CPU frequency
- What is the equivalent of /proc/cpuinfo on FreeBSD v8.1?
- FLOPs per cycle for Sandy Bridge and Haswell and others SSE2 / AVX / AVX2 / AVX-512
- How to store items in the LIFO stack in a cache-friendly manner?
- CPU cores, threads, and optimal number of workers in Python
- How to directly access a GPU?