I was referring to Intel's manual on the Xeon Phi instruction set and wasn't able to understand how the scatter/gather instructions work.
Suppose if I have the following vector of doubles:
A-> |b4|a4|b3|a3|b2|a2|b1|a1|
Is it possible to create 4 vectors as follows:
V1->|b1|a1|b1|a1|b1|a1|b1|a1|
V2->|b2|a2|b2|a2|b2|a2|b2|a2|
V3->|b3|a3|b3|a3|b3|a3|b3|a3|
V4->|b4|a4|b4|a4|b4|a4|b4|a4|
using these instructions? Is there any other way to achieve this?
Got this from the Intel Forums (answered by Evgueni Petrov):
__m512d V1 = (__m512d)_mm512_extload_epi32(&Addr, _MM_UPCONV_EPI32_NONE, _MM_BROADCAST_4X16, _MM_HINT_NONE);
where 'Addr' is the address of the location in memory, from which we loaded the doubles into vector 'A'.
We can do a similar operation for V2,V3,V4, by using &(Addr+2), &(Addr+4) and &(Addr+6) respectively.