Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:
FOR j := 0 to 3
i := j*32
IF imm8[j%8]
dst[i+31:i] := 0
ELSE
dst[i+31:i] := tmp2[i+31:i]
FI
ENDFOR
. The access into imm8
confuses me: IF imm8[j%8]
. As j
is within the range 0..3
, the modulo 8 part doesn't seem to do anything. Does this maybe signal a convertion that I am not aware of? Or is %
not "modulo" in this case?
Seems like a pointless modulo.
Intel's documentation for the corresponding asm instruction, insertps
, doesn't use any %
modulo operations in the pseudocode. It uses ZMASK ←imm8[3:0]
and then basically unrolls that part of the pseudocode where this uses a loop, with checks like
IF (ZMASK[2] = 1) THEN DEST[95:64]←00000000H
ELSE DEST[95:64]←TMP2[95:64]
This is just showing how the low 4 bits of the immediate perform zero-masking on the 4 dword elements of the final result, after the insert of an element from another vector, or a scalar in memory.
(There's no intrinsic for insert directly from memory; you'd need an intrinsic for movss
and then hope the compiler folds that load into a memory operand for insertps
. With a memory source, imm8[7:6]
are ignored, just taking that scalar dword as the element to insert (that's the ELSE COUNT_S←0
in the asm pseudocode), but then everything else works the same, including the zero-masking you're asking about.)