parallel-processingx86-64matrix-multiplicationsimdblas

Multi-threaded fixed-size matrix-vector multiplication optimized for many-core CPUs with non-uniform caches


I would like to implement a parallel matrix-vector multiplication for a fixed size matrix (~3500x3500 floats) optimized for my CPUs and cache layout (AMD Zen 2/4) that is repeatedly executed for changing input vectors (set up time is not critical, sustained performance is). Programming language is C++.

Can anyone point me at good (perhaps optimal) strategies how to partition the matrix and the threads with respect to cache utilization and synchronization (reduction +=) overhead? Like what block size is best, and how to traverse the multiplication best with several threads? I'd then try to apply the strategy to my particular CPUs.

I'm free to duplicate matrix data for cache efficiency across multiple CCXs, and the matrix doesn't need to be contiguous in RAM either. I can choose any format and order that promises best efficiency.

Alternatively, I appreciate, too, if anybody knows of such a library or is able to share code. Don't need to reinvent things :)

Thanks.


Solution

  • Here's a multi-threaded SIMD 32-bit MVM implementation, based on @Soonts suggestions, that performs somewhat decently on an AMD 7742.

    https://gist.github.com/eastwindow/022668c9b495d932dc49dd8c45c81a7d

    4096 x 4096 matrix: 30 us

    3456 x 3456 matrix: 23 us

    2048 x 2048 matrix: 7 us

    1632 x 1632 matrix: 5.5 us

    Synchronization overhead of 64 threads (no MVM): 2 us

    All timings are averages of 1,000,000 repetitions with no break between repetitions.

    However, with a break between repetitions, e.g. set wait = 200000; and remove multiplyInner_omp_simd_a() in void stay_busy( int threadnr ), the MVMs take a lot longer. Adding _mm_pause() doesn't help, neither do other things that I tried to keep the AVX units busy.

    Is there another way to keep the MVM fast after a break other than doing the MVM continuously?

    (The system is well tuned for low-latency and determinism. The cores on the socket are completely isolated, any power-saving feature in the CPU and frequency up- and down-scaling is disabled)

    Another thing I find noteworthy is the synchronization with atomics. If each atomic is on its own cache line, the synchronization takes the longest (4 us). Using 4 atomics per cache line results in the lowest overhead, 2 us, despite the false sharing. It seems as if the cache coherence protocols are executed faster than loading from 64 separate cache lines.