How can I replace a missing VPERMIL2PS instruction, using equivalent instructions in AVX2?
VPERMIL2PS ymm1, ymm2, ymm3, ymm4/m256, imz2
Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls.
VPERMIL2PS (VEX.256 encoded version)
DEST[31:0] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])
Intel C/C++ Compiler Intrinsic Equivalent
VPERMIL2PS __m128 _mm_permute2_ps (__m128 a, __m128 b, __m128i ctrl, int imm)
VPERMIL2PS __m256 _mm256_permute2_ps (__m256 a, __m256 b, __m256i ctrl, int imm)
VPERMIL2PS ymm1, ymm2, ymm3,ymm4/m256, imz2 Description - Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls. imz2: Part of the is4 immediate byte providing control functions that apply to two-source permute instructions.
The closest instruction is VPERMILPS .. and this instruction still works
VPERMILPS (256-bit immediate version)
DEST[31:0] Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] Select4(SRC1[127:0], imm8[7:6]);
DEST[159:128] Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160] Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192] Select4(SRC1[255:128], imm8[5:4]);
DEST[255:224] Select4(SRC1[255:128], imm8[7:6]);
VPERMILPS ymm1, ymm2, ymm3/m256 Description - RVM V/V AVX Permute single-precision floating-point values in ymm2 using controls from ymm3/mem and store result in ymm1.
It’s hard for me to say how it will be right, because for reliability, you need to emulate the instruction VPERMIL2PS, therefore I appeal to local specialists!
Recent Intel(R) AVX Architectural Changes January 29, 2009 Removed: VPERMIL2PS and VPERMIL2PD
All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.
Intel compiler does not understand this VPERMIL2PS instruction.
AVX-512 instructions require the latest processors, this is not a general solution .. The visual studio assembles this instruction successfully, but the instruction cannot be executed on the processor, throwing an exception.
Disassembled code
align 20h;
Yperm_msk ymmword 000000000100000006000000070000000C0000000D0000000A0000000B000000h
vmovups ymm0, [rbp+920h+var_8C0]
vmovdqu ymm1, Yperm_msk
vpermil2ps ymm0, ymm0, [rbp+920h+var_880], ymm1, 920h+var_920
vmovups [rbp+920h+var_1A0], ymm0
Full description of the instruction
Operation
select2sp(src1, src2, sel) // This macro is used by another macro “sel_and_condzerosp“ below
{
if (sel[2:0]=0) then TMP src1[31:0]
if (sel[2:0]=1) then TMP src1[63:32]
if (sel[2:0]=2) then TMP src1[95:64]
if (sel[2:0]=3) then TMP src1[127:96]
if (sel[2:0]=4) then TMP src2[31:0]
if (sel[2:0]=5) then TMP src2[63:32]
if (sel[2:0]=6) then TMP src2[95:64]
if (sel[2:0]=7) then TMP src2[127:96]
return TMP
}
sel_and_condzerosp(src1, src2, sel) // This macro is used by VPERMIL2PS
{
TMP[31:0] select2sp(src1[127:0], src2[127:0], sel[2:0])
IF (imm8[1:0] = 2) AND (sel[3]=1) THEN TMP[31:0] 0
IF (imm8[1:0] = 3) AND (sel[3]=0) THEN TMP[31:0] 0
return TMP
}
VPERMIL2PS (VEX.256 encoded version)
DEST[31:0] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96] sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224] sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])
The way the Bochs emulates this instruction
class bxInstruction_c;
void BX_CPP_AttrRegparmN(1) BX_CPU_C::VPERMIL2PS_VdqHdqWdqIbR(bxInstruction_c *i)
{
BxPackedYmmRegister op1 = BX_READ_YMM_REG(i->src1());
BxPackedYmmRegister op2 = BX_READ_YMM_REG(i->src2());
BxPackedYmmRegister op3 = BX_READ_YMM_REG(i->src3()), result;
unsigned len = i->getVL();
result.clear();
for (unsigned n=0; n < len; n++) {
xmm_permil2ps(&result.ymm128(n), &op1.ymm128(n), &op2.ymm128(n), &op3.ymm128(n), i->Ib() & 3);
}
BX_WRITE_YMM_REGZ_VLEN(i->dst(), result, len);
BX_NEXT_INSTR(i);
}
BX_CPP_INLINE void xmm_permil2ps(BxPackedXmmRegister *r, const BxPackedXmmRegister *op1, const BxPackedXmmRegister *op2, const BxPackedXmmRegister *op3, unsigned m2z)
{
for(unsigned n=0; n < 4; n++) {
Bit32u ctrl = op3->xmm32u(n);
if ((m2z ^ ((ctrl >> 3) & 0x1)) == 0x3)
r->xmm32u(n) = 0;
else
r->xmm32u(n) = (ctrl & 0x4) ? op1->xmm32u(ctrl & 0x3) : op2->xmm32u(ctrl & 0x3);
}
}
They're not "gone", they never existed in any real CPUs in the first place. 2009 is before the first CPU with AVX1 was released, while AVX was still in planning stages. IDK what you were looking at that even mentioned them.
Current versions of the ISA ref manual, or HTML extracts of it don't mention it. Neither does Intel's intrinsics guide. Maybe a 10-year-old version of a "future extensions" manual from before Sandybridge was released?
because for reliability, you need to emulate the instruction VPERMIL2PS
No you don't, it never existed in the first place so there's no code that uses it. (Or very little; possibly some written in anticipation based on early pre-release AVX documentation). You only need to implement exactly the functionality that you need for any given problem.
You tagged this (AMD) XOP but you only cited Intel documents; XOP did have some 2-input shuffles I think but I didn't go check the docs. Of course only ever for 128-bit vectors.
AVX1 does have some 2-input shuffles but none with variable control. There's vshufps
/pd
with immediate control, and vunpckl/hps
and ...pd
that do two separate in-lane versions of the corresponding 128-bit shuffle.
Worst case, you can build any fixed 2-input in-lane shuffle out of 2x vshufps
+ vblendps
. Best-case is one vshufps
, or in the middle is vshufps
+ vblendps
or 2x vshufps
(e.g. collect the elements you want into one vector then put them in the right order). Any of those vshufps
shuffles can be vunpcklps
or hps
. Keep in mind that immediate vblendps
is cheap but shuffles only have 1/clock throughput on Intel (port 5 only until Ice Lake).
You could even use variable-control 2x vpermilps
and compare or shift + vblendvps
to emulate vpermil2ps
, because vpermilps
ignores high bits in the index. So this would be the BOCHS implementation of (ctrl & 0x4) ? op2[ctrl & 0x3] : op2[ctrl & 0x3];
where you shuffle both inputs on ctrl
with vpermilps
(which implicitly only looks at the low 2 bits), and you blend on ctrl & 4
by shifting that bit to the top with an integer shift.
(Optionally also emulate the conditional zeroing with vandps
by using vpslld
to put the 3rd index bit at the top for blend, and vpsrad
or a compare-against-zero result to create an AND mask for vpand
. Or on Skylake, vblendvps
is 2 uops for any port so you could just use that to blend in zeros instead of shift/and or cmp/and).
But don't just naively drop this in if you care about performance for a compile-time constant shuffle control. Instead build the equivalent shuffle out of the available 2-input operations. That's why I'm not bothering to write out a full implementation in C.
AVX2 only added a few new 2-input shuffles that might be useful here: 256-bit vpalignr
which is like 2 in-lane palignr
instructions. It also added integer vpunpckl/h b/w/d/q
but we already have vunpckl/hps
from AVX1.
A true variable-control 2-input shuffle didn't appear until AVX512F vpermt2ps
and vpermi2ps
/pd
.
But it doesn't support conditional zeroing based on high bits of index elements like pshufb
or the proposed vpermil2ps
; instead use a mask register for zero masking. e.g.
vmovd2m k1, ymm0 ; extract top bit of dword elements
knotw k1, k1 ; cleared for elements to be zeroed
vpermi2ps ymm0{k1}{z}, ymm0, ymm1, ymm2 ; ymm0=indices ymm1,ymm2 = table
; indices overwritten with result
; use vpermt2ps instead to overwrite one of the "table" inputs instead of the index vector.
Or probably better to use vpfclassps k1, ymm0, some_constant
to get k1
set for non-negative values, avoiding the need for a knot
. On Skylake-X it's a single uop.
Or use vptestnmd
with a set1(1UL<<31)
mask to set a mask register = !signbit
of a vector.
It's also not "in lane" so you'd potentially need to tweak the indices, adding 8 for indices > 4 I think. vpermi/t2ps
indexes into the concatenation of the two vectors, so cross-lane within one source happens before selecting the other input.