intrinsicssse4

Why does the pseudocode of _mm_insert_ps calculate %8?


Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:

FOR j := 0 to 3
    i := j*32
    IF imm8[j%8]
        dst[i+31:i] := 0
    ELSE
        dst[i+31:i] := tmp2[i+31:i]
    FI
ENDFOR

. The access into imm8 confuses me: IF imm8[j%8]. As j is within the range 0..3, the modulo 8 part doesn't seem to do anything. Does this maybe signal a convertion that I am not aware of? Or is % not "modulo" in this case?


Solution

  • Seems like a pointless modulo.

    Intel's documentation for the corresponding asm instruction, insertps, doesn't use any % modulo operations in the pseudocode. It uses ZMASK ←imm8[3:0] and then basically unrolls that part of the pseudocode where this uses a loop, with checks like

    IF (ZMASK[2] = 1) THEN DEST[95:64]←00000000H
        ELSE DEST[95:64]←TMP2[95:64]
    

    This is just showing how the low 4 bits of the immediate perform zero-masking on the 4 dword elements of the final result, after the insert of an element from another vector, or a scalar in memory.

    (There's no intrinsic for insert directly from memory; you'd need an intrinsic for movss and then hope the compiler folds that load into a memory operand for insertps. With a memory source, imm8[7:6] are ignored, just taking that scalar dword as the element to insert (that's the ELSE COUNT_S←0 in the asm pseudocode), but then everything else works the same, including the zero-masking you're asking about.)