I am trying to measure the # of computations performed in a C++ program (FLOPS). I am using a Broadwell-based CPU and not using GPU. I have tried the following command, which I included all the FP-related events I found.
perf stat -e fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.256b_packed_single,fp_arith_inst_retired.double,fp_arith_inst_retired.packed,fp_arith_inst_retired.scalar,fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.scalar_single,fp_arith_inst_retired.single,inst_retired.x87 ./test_exe
I got something as follows:
Performance counter stats for './test_exe':
0 fp_arith_inst_retired.128b_packed_double (36.36%)
0 fp_arith_inst_retired.128b_packed_single (36.36%)
0 fp_arith_inst_retired.256b_packed_double (36.37%)
0 fp_arith_inst_retired.256b_packed_single (36.37%)
4,520,439,602 fp_arith_inst_retired.double (36.37%)
0 fp_arith_inst_retired.packed (36.36%)
4,501,385,966 fp_arith_inst_retired.scalar (36.36%)
4,493,140,957 fp_arith_inst_retired.scalar_double (36.37%)
0 fp_arith_inst_retired.scalar_single (36.36%)
0 fp_arith_inst_retired.single (36.36%)
82,309,806 inst_retired.x87 (36.36%)
65.861043789 seconds time elapsed
65.692904000 seconds user
0.164997000 seconds sys
Questions:
fp_arith_inst_retired.double
, fp_arith_inst_retired.scalar
and fp_arith_inst_retired.scalar_double
? These counters are related to SSE/AVX computations, right?perf
results?Thanks.
The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi]
(https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.
If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native
), and the only math you do is with double
not float
, then all the math operations will be done with scalar-double instructions.
Each such instruction will be counted by the fp_arith_inst_retired.double
, .scalar
, and .scalar-double
events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).
So you have 4,493,140,957
FLOPs over 65.86
seconds.
4493140957 / 65.86 / 1e9
~= 0.0682 GFLOP/s, i.e. very low.
If you had had any counts for 128b_packed_double
, you'd multiply those by 2. As noted in the perf list
description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double
elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list
output, e.g. times 8 for 256b_packed_single
.
So you do need to separate the SIMD events by type and width, but you could just look at .scalar
without separating single and double.
See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question
(36.37%)
is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".
You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.