simdaltivec

Avoiding invalid memory load with SIMD instructions


I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses:

float X[SIZE];
vector float V0;
unsigned FLOAT_VEC_SIZE = sizeof(vector float);
for (int load_index =0; load_index < SIZE; load_index+=FLOAT_VEC_SIZE)
{
    V0 = vec_ld(load_index, X);
    /* some computation involving V0*/
}

Now if SIZE is not a multiple of FLOAT_VEC_SIZE, it is possible that V0 contains some invalid memory elements in the last loop iteration. One way to avoid that is to reduce the loop by one iteration, another one is to mask off the potential invalid elements, is there any other useful trick here? Considering the above is inner most in a set of nested loops. So any additional non-SIMD instruction will come with a performance penalty!


Solution

  • Ideally you should pad your array to a multiple of vec_step(vector float) (i.e. multiple of 4 elements) and then mask out any additional unwanted values from SIMD processing or use scalar code to deal with the last few elements, e.g.

    const INT VF_ELEMS = vec_step(vector float);
    const int VEC_SIZE = (SIZE + VF_ELEMS - 1) / VF_ELEMS; // number of vectors in X, rounded up
    vector float VX[VEC_SIZE];   // padded array with 16 byte alignment
    float *X = = (float *)VX;    // float * pointer to base of array
    
    for (int i = 0; i <= SIZE - VF_ELEMS; i += VF_ELEMS)
    {                            // for each full SIMD vector
        V0 = vec_ld(0, &X[i]);
        /* some computation involving V0 */
    }
    if (i < SIZE)                // if we have a partial vector at the end
    {
    #if 1                        // either use SIMD and mask out the unwanted values
        V0 = vec_ld(0, &X[i]);
        /* some SIMD computation involving partial V0 */
    #else                        // or use a scalar loop for the remaining 1..3 elements
        /* small scalar loop to handle remaining points */
    #endif
    }