VPERMIL2PS and VPERMIL2PD All PERMIL2 instructions are gone - Replacement of missing instructions

How can I replace a missing VPERMIL2PS instruction, using equivalent instructions in AVX2?

VPERMIL2PS ymm1, ymm2, ymm3, ymm4/m256, imz2

Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls.

VPERMIL2PS (VEX.256 encoded version)
DEST[31:0]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])

Intel C/C++ Compiler Intrinsic Equivalent

VPERMIL2PS __m128 _mm_permute2_ps (__m128 a, __m128 b, __m128i ctrl, int imm)
VPERMIL2PS __m256 _mm256_permute2_ps (__m256 a, __m256 b, __m256i ctrl, int imm)

VPERMIL2PS ymm1, ymm2, ymm3,ymm4/m256, imz2 Description - Permute single-precision floatingpoint values in ymm2 and ymm3 using controls from ymm4/mem, the results are stored in ymm1 with selective zero-match controls. imz2: Part of the is4 immediate byte providing control functions that apply to two-source permute instructions.

The closest instruction is VPERMILPS .. and this instruction still works

VPERMILPS (256-bit immediate version)
DEST[31:0]  Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32]  Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64]  Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96]  Select4(SRC1[127:0], imm8[7:6]);
DEST[159:128]  Select4(SRC1[255:128], imm8[1:0]);
DEST[191:160]  Select4(SRC1[255:128], imm8[3:2]);
DEST[223:192]  Select4(SRC1[255:128], imm8[5:4]);
DEST[255:224]  Select4(SRC1[255:128], imm8[7:6]);

VPERMILPS ymm1, ymm2, ymm3/m256 Description - RVM V/V AVX Permute single-precision floating-point values in ymm2 using controls from ymm3/mem and store result in ymm1.

It’s hard for me to say how it will be right, because for reliability, you need to emulate the instruction VPERMIL2PS, therefore I appeal to local specialists!

Recent Intel(R) AVX Architectural Changes January 29, 2009 Removed: VPERMIL2PS and VPERMIL2PD

All PERMIL2 instructions are gone – both the 128-bit and 256-bit flavors. Like the FMA below, they used the VEX.W bit to select which source was from memory – we’re not moving in the direction of using VEX.W for that purpose any more.

Intel compiler does not understand this VPERMIL2PS instruction.

AVX-512 instructions require the latest processors, this is not a general solution .. The visual studio assembles this instruction successfully, but the instruction cannot be executed on the processor, throwing an exception.

Disassembled code

align 20h;
Yperm_msk ymmword 000000000100000006000000070000000C0000000D0000000A0000000B000000h

                vmovups ymm0, [rbp+920h+var_8C0]
                vmovdqu ymm1, Yperm_msk
                vpermil2ps ymm0, ymm0, [rbp+920h+var_880], ymm1, 920h+var_920
                vmovups [rbp+920h+var_1A0], ymm0

Full description of the instruction

Operation

select2sp(src1, src2, sel) // This macro is used by another macro “sel_and_condzerosp“ below
{
if (sel[2:0]=0) then TMP  src1[31:0]
if (sel[2:0]=1) then TMP  src1[63:32]
if (sel[2:0]=2) then TMP  src1[95:64]
if (sel[2:0]=3) then TMP  src1[127:96]
if (sel[2:0]=4) then TMP  src2[31:0]
if (sel[2:0]=5) then TMP  src2[63:32]
if (sel[2:0]=6) then TMP  src2[95:64]
if (sel[2:0]=7) then TMP  src2[127:96]
return TMP
}
sel_and_condzerosp(src1, src2, sel) // This macro is used by VPERMIL2PS
{
TMP[31:0]  select2sp(src1[127:0], src2[127:0], sel[2:0])
IF (imm8[1:0] = 2) AND (sel[3]=1) THEN TMP[31:0]  0
IF (imm8[1:0] = 3) AND (sel[3]=0) THEN TMP[31:0]  0
return TMP
}

VPERMIL2PS (VEX.256 encoded version)

DEST[31:0]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[3:0])
DEST[63:32]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[35:32])
DEST[95:64]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[67:64])
DEST[127:96]  sel_and_condzerosp(SRC1[127:0], SRC2[127:0], SRC3[99:96])
DEST[159:128]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[131:128])
DEST[191:160]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[163:160])
DEST[223:192]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[195:192])
DEST[255:224]  sel_and_condzerosp(SRC1[255:128], SRC2[255:128], SRC3[227:224])

The way the Bochs emulates this instruction

class bxInstruction_c;

void BX_CPP_AttrRegparmN(1) BX_CPU_C::VPERMIL2PS_VdqHdqWdqIbR(bxInstruction_c *i)
{
  BxPackedYmmRegister op1 = BX_READ_YMM_REG(i->src1());
  BxPackedYmmRegister op2 = BX_READ_YMM_REG(i->src2());
  BxPackedYmmRegister op3 = BX_READ_YMM_REG(i->src3()), result;
  unsigned len = i->getVL();

  result.clear();

  for (unsigned n=0; n < len; n++) {
    xmm_permil2ps(&result.ymm128(n), &op1.ymm128(n), &op2.ymm128(n), &op3.ymm128(n), i->Ib() & 3);
  }

  BX_WRITE_YMM_REGZ_VLEN(i->dst(), result, len);

  BX_NEXT_INSTR(i);
}

BX_CPP_INLINE void xmm_permil2ps(BxPackedXmmRegister *r, const BxPackedXmmRegister *op1, const BxPackedXmmRegister *op2, const BxPackedXmmRegister *op3, unsigned m2z)
{
  for(unsigned n=0; n < 4; n++) {
    Bit32u ctrl = op3->xmm32u(n);
    if ((m2z ^ ((ctrl >> 3) & 0x1)) == 0x3)
      r->xmm32u(n) = 0;
    else
      r->xmm32u(n) = (ctrl & 0x4) ? op1->xmm32u(ctrl & 0x3) : op2->xmm32u(ctrl & 0x3);
  }
}

Solution

They're not "gone", they never existed in any real CPUs in the first place. 2009 is before the first CPU with AVX1 was released, while AVX was still in planning stages. IDK what you were looking at that even mentioned them.

Current versions of the ISA ref manual, or HTML extracts of it don't mention it. Neither does Intel's intrinsics guide. Maybe a 10-year-old version of a "future extensions" manual from before Sandybridge was released?

because for reliability, you need to emulate the instruction VPERMIL2PS

No you don't, it never existed in the first place so there's no code that uses it. (Or very little; possibly some written in anticipation based on early pre-release AVX documentation). You only need to implement exactly the functionality that you need for any given problem.

You tagged this (AMD) XOP but you only cited Intel documents; XOP did have some 2-input shuffles I think but I didn't go check the docs. Of course only ever for 128-bit vectors.

AVX1 does have some 2-input shuffles but none with variable control. There's vshufps/pd with immediate control, and vunpckl/hps and ...pd that do two separate in-lane versions of the corresponding 128-bit shuffle.

Worst case, you can build any fixed 2-input in-lane shuffle out of 2x vshufps + vblendps. Best-case is one vshufps, or in the middle is vshufps + vblendps or 2x vshufps (e.g. collect the elements you want into one vector then put them in the right order). Any of those vshufps shuffles can be vunpcklps or hps. Keep in mind that immediate vblendps is cheap but shuffles only have 1/clock throughput on Intel (port 5 only until Ice Lake).

You could even use variable-control 2x vpermilps and compare or shift + vblendvps to emulate vpermil2ps, because vpermilps ignores high bits in the index. So this would be the BOCHS implementation of (ctrl & 0x4) ? op2[ctrl & 0x3] : op2[ctrl & 0x3]; where you shuffle both inputs on ctrl with vpermilps (which implicitly only looks at the low 2 bits), and you blend on ctrl & 4 by shifting that bit to the top with an integer shift.

(Optionally also emulate the conditional zeroing with vandps by using vpslld to put the 3rd index bit at the top for blend, and vpsrad or a compare-against-zero result to create an AND mask for vpand. Or on Skylake, vblendvps is 2 uops for any port so you could just use that to blend in zeros instead of shift/and or cmp/and).

But don't just naively drop this in if you care about performance for a compile-time constant shuffle control. Instead build the equivalent shuffle out of the available 2-input operations. That's why I'm not bothering to write out a full implementation in C.

AVX2 only added a few new 2-input shuffles that might be useful here: 256-bit vpalignr which is like 2 in-lane palignr instructions. It also added integer vpunpckl/h b/w/d/q but we already have vunpckl/hps from AVX1.

A true variable-control 2-input shuffle didn't appear until AVX512F vpermt2ps and vpermi2ps/pd.

But it doesn't support conditional zeroing based on high bits of index elements like pshufb or the proposed vpermil2ps; instead use a mask register for zero masking. e.g.

  vmovd2m    k1, ymm0                              ; extract top bit of dword elements
  knotw      k1, k1                                ; cleared for elements to be zeroed
  vpermi2ps  ymm0{k1}{z}, ymm0, ymm1, ymm2         ; ymm0=indices   ymm1,ymm2 = table
  ; indices overwritten with result
  ; use vpermt2ps instead to overwrite one of the "table" inputs instead of the index vector.

Or probably better to use vpfclassps k1, ymm0, some_constant to get k1 set for non-negative values, avoiding the need for a knot. On Skylake-X it's a single uop.

Or use vptestnmd with a set1(1UL<<31) mask to set a mask register = !signbit of a vector.

It's also not "in lane" so you'd potentially need to tweak the indices, adding 8 for indices > 4 I think. vpermi/t2ps indexes into the concatenation of the two vectors, so cross-lane within one source happens before selecting the other input.