cpuintelcpu-architectureavxflops

Theoretical maximum performance (FLOPS ) of Intel Xeon E5-2640 v4 CPU, using only addition?


I am confused about the theoretical maximum performance of the Intel Xeon E5-2640 v4 CPU (Boardwell-based). In this post, >800GFLOPS; in this post, about 200GFLOPS; in this post, 3.69GFLOPS per core, 147.70GFLOPS per computer. So what is the theoretical maximum performance of Intel Xeon E5-2640 v4 CPU?

Some specifications:

I tried to compute the theoretical maximum FLOPS. Based on my understanding, it should be (Max turbo frequency) * (IPC) * (#SIMD), which is 3.4 * 2 * 8 = 54.4GFLOPS, is it right?

Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)? What if additions and multiplications do not appear at the same time? (eg. if the workload only contains additions, is *2 appropriate?)

Besides, the above computation should be the maximum FLOPS per core, right?


Solution

  • 3.4 GHz is the max single-core turbo (and also 2-core), so note that this isn't the per-core GFLOPS, it's the single-core GFLOPS.

    The max all-cores turbo is 2.6 GHz on that CPU, and probably won't sustain that for long with all cores maxing out their SIMD FP execution units. That's the most power-intensive thing x86 CPUs can do. So it will likely drop back to 2.4 GHz if you actually keep all cores busy.


    And yes you're missing a factor of two because FMA counts as two FP operations, and that's what you need to do to achieve the hardware's theoretical max FLOPS. FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . (Your Broadwell is the same as Haswell for max-throughput purposes.)

    If you're only using addition then only have one FLOP per SIMD element per instruction, and also only 1/clock FP instruction throughput on a v4 (Broadwell) or earlier.

    Haswell / Broadwell have two fully-pipelined SIMD FMA units (on ports 0 and 1), and one fully-pipelined SIMD FP-add unit (on port 1) with lower latency than FMA.

    The FP-add unit is on the same execution port as one of the FMA units, so it can start 2 FP uops per clock, up to one of which can be pure addition. (Unless you do addition x+y as fma(x, 1.0, y), trading higher latency for more throughput.)


    IPC (instruction per cycle) = 2;

    Nope, that's the number of FP math instructions per cycle, max, not total instructions per clock. The pipeline's narrowest point is 4 uops wide, so there's room for a bit of loop overhead and a store instruction every cycle as well as two SIMD FP operations.

    But yes, 2 FP operations started per clock, if they're not both addition.

    Should it be multiplied by 2 (due to the pipeline technique which makes addition and multiplication can be done in parallel)?

    You're already multiplying by IPC=2 for parallel additions and multiplications.

    If you mean FMA (Fused Multiply-Add), then no, that's literally doing them both as part of a single operation, not in parallel as a "pipeline technique". That's why it's called "fused".

    FMA has the same latency as multiply in many CPUs, not multiply and then addition. (Although on Broadwell, FMA latency = 5 cycles, vmulpd latency = 3 cycles, vaddpd latency = 3 cycles. All are fully pipelined, with a throughput discussed in the rest of this answer, since theoretical max throughput requires arranging your calculations to not bottleneck on the latency of addition or multiplication. e.g. using multiple accumulators for a dot product or other reduction.) Anyway, point being, a hardware FMA execution unit is not terribly more complex than an FP multiplier or adder, and you shouldn't think of it as two separate operations.

    If you write a*b + c in the source, a compiler can contract that into an FMA, instead of rounding the a*b to a temporary result before addition, depending on compiler options (and defaults) to allow that or not.

    Instruction Set Extensions: AVX2, so #SIMD = 256/64 = 8;

    256/64 = 4, not 8. In a 32-byte (256-bit) SIMD vector, you can fit 4 double-precision elements.


    Per core per clock, Haswell/Broadwell can begin up to: