c++ssesimdcpu-registersregister-allocation

When does data move around between SSE registers and the stack?


I'm not exactly sure what happens when I call _mm_load_ps? I mean I know I load an array of 4 floats into a __m128, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128 data type still on the stack? I mean obviously there aren't enough registers for arbitrary amounts of vectors to be loaded in. So these 128 bits of data are moved back and forth each time you use some SIMD instruction to make computations? If so, than what is the point of _mm_load_ps?

Maybe I have it all wrong?


Solution

  • An Intel processor with SSE, AVX, or AVX-512 can have from 8 to 32 SIMD registers (see below). The number of registers also depends on if it's 32-bit code or 64-bit code as well. So when you call _mm_load_ps the values are loaded into SIMD register. If all the registers are used then some will have to be spilled onto the stack.

    Exactly like if you have a lot of int or scalar float variables and the compiler can't keep them all the currently "live" ones in registers - load/store intrinsics mostly just exist to tell the compiler about alignment, and as an alternative to pointer-casting onto other C data types. Not because they have to compile to actual loads or stores, or that those are the only ways for compilers to emit vector load or store instructions.


    Processor with SSE

    8  128-bit registers labeled XMM0 - XMM7  //32-bit operating mode
    16 128-bit registers labeled XMM0 - XMM15 //64-bit operating mode
    

    Processor with AVX/AVX2

    8  256-bit registers labeled YMM0 - YMM7  //32-bit operating mode
    16 256 bit registers labeled YMM0 - YMM15 //64-bt operating mode
    

    Processor with AVX-512 (2015/2016 servers, Ice Lake laptop, ?? desktop)

    8  512-bit registers labeled ZMM0 - ZMM31 //32-bit operating mode
    32 512-bit registers labeled ZMM0 - ZMM31 //64-bit operating mode
    

    Wikipedia has a good summary on this AVX-512.

    (Of course, the compiler can only use x/y/zmm16..31 if you tell it it's allowed to use AVX-512 instructions. Having an AVX-512-capable CPU does you no good when running machine code compiled to work on CPUs with only AVX2.)