math cpu-architecture hpc blas intel-mkl

arithmetic intensity of zgemv versus dgemv/sgemv?

The arithmetic intensity of sgemv (or dgemv) is derived in this set of exercises (https://florian.world/wp-content/uploads/FM-High-Performance-Computing-I-Assignment-1.pdf) to be: 0.5 / (1+c), where c is a constant.

I was wondering whether zgemv, the complex-counterpart of these operations, has the same arithmetic intensity as the purely real ones?

I think as follows:

A complex-complex multiplication of the type (a+b*i) * (c+d*i), where i^2=-1, is equal to (ac - cd) + i*(bc + ad), so there are 4 multiplications and two additions, for a total of 6 ops. In contrast, to multiply two purely real numbers, only one multiplication is needed.

To load a complex number, two double precision numbers have to be loaded, so this increases the memory traffic by 2 as well.

In total, the arithmetic intensity shall be roughly 3 times larger?

Many thanks!

Solution

gemv is a matrix-vector product, so each element of the result is a dot-product of the vector and a row or column of the matrix. That's a multiply and add, not just multiply.

On modern hardware, that's done with one FMA (Fused Multiply-Add) for about the same cost as one multiply, effectively getting the add part for free.

An optimized ZGEMV would also use some FMAs for the complex multiply and add, like 2 multiplies, 2 FMAs for just a complex multiply.

Or four FMAs if adding to existing accumulators of real and imaginary parts. (Creating two long dependency chains, so you'd need to unroll more to hide it, although you do already have instruction-level parallelism between the FMAs that add into the real vs. imaginary parts.)

So with FMA, it should be about 2x the computational intensity: 4x as many FMAs for 2x as many loads per "element".

This is assuming the complex numbers are stored with separate arrays of real and imaginary parts, so SIMD loads can get vectors of [r0, r1, r2, r3] and [i0, i1, i2, i3], keeping the problem pure vertical with no shuffling like you'd need with a [r0,i0, r1,i1, ...] interleaved layout (Array of Structs aka AoS). If you need shuffles, then computational intensity is higher, but it's not "useful" work, and it's potentially on different execution units than the ones that can run FMA.