So I have a set of data with mixed values for packing purposes that goes like this:
{(Point_x, Point_y, Point_z, Scalar),
(Point_x, Point_y, Point_z, Scalar),
(Point_x, Point_y, Point_z, Scalar),
...}
Where each Point_x, Point_y, Point_z, and Scalar are 32 bit floats.
Because of that I can load each point aligned, but i need to move the point x,y,z into its own register for my operation and then set the last value to a 1.f
in the __m128
register (where scalar would be). What instruction is used to set the last value to a 1.f
in the register and leave the other values untouched?
currently I am doing:
__m128 rPointMixed = _mm_load_ps( (float*)pPoint );
__m128 rOne =_mm_set1_ps(1.f);
__m128 rPoint = _mm_blend_ps(rPointMixed,rOne,0x8);
but might not be the most efficent solution, i am fine with sse4/avx/avx2 instruction though if it there is a really efficent way with them
struct Vec4f
{
float x;
float y;
float z;
float scalar;
};
Vec4f vData[10000];
//in reality this loop is unrolled to do 8 at a time, but rolled up for simplicity sake
for( int i = 0; i < 10000; ++i)
{
__m128 rPointData = _mm_load_ps( (float*)vData[i] );
//math there where it permutes the scalar and does math with it
//3 just cause it is the last value?
__m128 rPoint = [unknownIntrinsic](rPointData ,1.f, 3); //?
//point math here
}
Thanks for the help in advance
Unless there's an instruction with a better throughput than blendps
(0.33 CPI on Intel), what you're doing is already ideal.
Note that you don't actually need to call _mm_set1_ps(1.f)
for every iteration of the loop (so your "unknown intrinsic" is actually just _mm_blend_ps
), as rOne
is constant. With optimizations enabled, however, most compilers will be smart enough to do that just once before the loop.