Understanding Linux perf FP counters and computation of FLOPS in a C++ program

I am trying to measure the # of computations performed in a C++ program (FLOPS). I am using a Broadwell-based CPU and not using GPU. I have tried the following command, which I included all the FP-related events I found.

perf stat -e fp_arith_inst_retired.128b_packed_double,fp_arith_inst_retired.128b_packed_single,fp_arith_inst_retired.256b_packed_double,fp_arith_inst_retired.256b_packed_single,fp_arith_inst_retired.double,fp_arith_inst_retired.packed,fp_arith_inst_retired.scalar,fp_arith_inst_retired.scalar_double,fp_arith_inst_retired.scalar_single,fp_arith_inst_retired.single,inst_retired.x87 ./test_exe

I got something as follows:

 Performance counter stats for './test_exe':

                 0      fp_arith_inst_retired.128b_packed_double    (36.36%)
                 0      fp_arith_inst_retired.128b_packed_single     (36.36%)
                 0      fp_arith_inst_retired.256b_packed_double     (36.37%)
                 0      fp_arith_inst_retired.256b_packed_single     (36.37%)
     4,520,439,602      fp_arith_inst_retired.double     (36.37%)
                 0      fp_arith_inst_retired.packed     (36.36%)
     4,501,385,966      fp_arith_inst_retired.scalar     (36.36%)
     4,493,140,957      fp_arith_inst_retired.scalar_double     (36.37%)
                 0      fp_arith_inst_retired.scalar_single     (36.36%)
                 0      fp_arith_inst_retired.single     (36.36%)
        82,309,806      inst_retired.x87              (36.36%)

      65.861043789 seconds time elapsed

      65.692904000 seconds user
       0.164997000 seconds sys

Questions:

Although the C++ program is a large project, I did not use any SSE/AVX instructions. I am not familiar with SSE/AVX instruction set. The project is just written by the "ordinary" C++. Why does it contain many fp_arith_inst_retired.double, fp_arith_inst_retired.scalar and fp_arith_inst_retired.scalar_double? These counters are related to SSE/AVX computations, right?
What do the percentages in brackets mean? such as (36.37%)
How can I compute the FLOPS in my C++ program based on the perf results?

Thanks.

Solution

The normal way for C++ compilers to do FP math on x86-64 is with scalar versions of SSE instructions, e.g. addsd xmm0, [rdi] (https://www.felixcloutier.com/x86/addsd). Only legacy 32-bit builds default to using the x87 FPU for scalar math.

If your compiler didn't manage to auto-vectorize anything (e.g. you didn't use g++ -O3 -march=native), and the only math you do is with double not float, then all the math operations will be done with scalar-double instructions.

Each such instruction will be counted by the fp_arith_inst_retired.double, .scalar, and .scalar-double events. They overlap, basically sub-filters of the same event. (FMA operations count as two, even though they're still only one instruction, so these are FLOP counts not uops or instructions).

So you have 4,493,140,957 FLOPs over 65.86 seconds.
4493140957 / 65.86 / 1e9 ~= 0.0682 GFLOP/s, i.e. very low.

If you had had any counts for 128b_packed_double, you'd multiply those by 2. As noted in the perf list description: "each count represents 2 computation operations, one for each element" because a 128-bit vector holds two 64-bit double elements. So each count for this even is 2 FLOPs. Similarly for others, follow the scale factors described in the perf list output, e.g. times 8 for 256b_packed_single.

So you do need to separate the SIMD events by type and width, but you could just look at .scalar without separating single and double.

See also FLOP measurement, one of the duplicates of FLOPS in Python using a Haswell CPU (Intel Core Processor (Haswell, no TSX)) which was linked on your previous question

(36.37%) is how much of the total time that even was programmed on a HW counter. You used more events than there are counters, so perf multiplexed them for you, swapping every so often and extrapolating based on that statistical sampling to estimate the total over the run-time. See Perf tool stat output: multiplex and scaling of "cycles".

You could get exact counts for the non-zero non-redundant events by leaving out the ones that are zero for a given build.