coptimizationvectorizationpremature-optimization

Efficiently taking the absolute value of an integer vector in C


The task is to set each element of a C integer array to its absolute value. I'm trying to do it as efficiently as possible. Below are a progression of optimizations that I've made. Please tell me if these are actually optimizations at all, and if any more can be made!

The first parameter of the function will be an integer array, and the second will be an integer size of that array.

Here's the standard implementation:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++)
    if(array[i] < 0)
      array[i] = - array[i];
}

That's plenty to satisfy any introductory programming course professor, but I'd like to play with it a little more and maybe learn something about optimization along the way.

Based on https://stackoverflow.com/a/2074403, a branchless absolute value:

void absolute (int array[], int n){
  for(int i = 0; i < n; i++){
    uint32_t temp = array[i] >> 31;     // make a mask of the sign bit
    array[i] ^= temp;                   // toggle the bits if value is negative
    array[i] += temp & 1;               // add one if value was negative
  }
}

Based on comparisons to zero being more efficient, and not needing an extra variable sitting around:

void absolute (int array[], int n){
  for(n--; n >= 0;){
    uint32_t temp = array[n] >> 31;
    array[n] ^= temp;
    array[n] += temp & 1;
  }
}

(does this vectorize anymore though?)

That's as far as I got. Can more be done to optimize this function?


Solution

  • Personally I rather like this question. It's questions like these that get you wondering if perhaps there is a way to make my own code better.

    Your final optimization is incorrect as it initializes n--, but n is never decremented again. To correct this you need for(n--; n >= 0; n--). Though the results I had found that decrementing or incrementing your for loop contained no noticeable advantage.

    If the values of the array are not truly randomly distributed I found that the simple if(array[i] < 0) used in the first implementation is actually significantly faster.

    Here's the code I used to benchmark:

    #include <stdio.h>
    #include <time.h>
    #include <stdlib.h>
    #include <stdint.h>
    #ifdef _OPT3
    #include <emmintrin.h>
    #include <tmmintrin.h>
    #endif
    
    int main(int argc, char **argv)
    {
            int *array;
            struct timespec tsstart, tsend;
            int ncount = 500000000;
            int i;
    
            array = malloc(sizeof(int) * ncount);
    
            for(i = 0; i < ncount; i++)
            {
                    array[i] = rand();
    #ifdef _DIST
                    if(rand() % 100 == 0) // make the values less likely to be negative.
    #else
                    if(rand() % 2 == 0) // the values are equeally likely to be negaitve as positive.
    #endif
                            array[i] = -rand();
            }
    
            clock_gettime(CLOCK_MONOTONIC, &tsstart);
    
    #ifdef _OPT1
            for(i = 0; i < ncount; i++)
            {
                    uint32_t ntemp = array[i] >> 31;
                    array[i] ^= ntemp;
                    array[i] += ntemp & 1;
            }
    #elif _OPT2
            for(ncount--; ncount >= 0; ncount--)
            {
                    uint32_t ntemp = array[ncount] >> 31;
                    array[ncount] ^= ntemp;
                    array[ncount] += ntemp & 1;
            }
    #elif _OPT3
            for(i = 0; i < ncount; i+=4)
            {
                    __m128i a3_a2_a1_a0 = _mm_loadu_si128((__m128i*)&array[i]);         //Load 4 int32 elements from array.
                    a3_a2_a1_a0 = _mm_abs_epi32(a3_a2_a1_a0);                           //Set absolute of 4 int32 elements in single instruction.
                    _mm_storeu_si128((__m128i*)(&array[i]), a3_a2_a1_a0);               //Store 4 int32 elements of array.
            }
    #elif _OPT4
            for(i = 0; i < ncount; i++)
            {
                    array[i] = abs(array[i]); // abs() is actually an intrinsic on gcc and msvc
            }
    #else
            for(i = 0; i < ncount; i++)
            {
                    if(array[i] < 0)
                    {
                            array[i] = -array[i];
                    }
            }
    #endif
    
            clock_gettime(CLOCK_MONOTONIC, &tsend);
    
            printf("start: %ld.%09ld\n", tsstart.tv_sec, tsstart.tv_nsec);
            printf("end: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);
    
            tsend.tv_sec -= tsstart.tv_sec;
            tsend.tv_nsec -= tsstart.tv_nsec;
            if(tsend.tv_nsec < 0)
            {
                    tsend.tv_sec--;
                    tsend.tv_nsec = 1000000000 + tsend.tv_nsec;
            }
            printf("diff: %ld.%09ld\n", tsend.tv_sec, tsend.tv_nsec);
    
            free(array);
    
            return 0;
    }
    

    Test Results

    Here's my results (the times are in seconds). These tests were run on Intel(R) Xeon(R) CPU W3580 @ 3.33GHz. gcc (Debian 4.9.2-10) 4.9.2

    // Implimentation One (No Optimizations)
    $ gcc -O3 -march=native test.c
    $ ./a.out
    start: 9221396.418007954
    end: 9221398.103490309
    diff: 1.685482355
    
    // Implimentation One Non Random Distrubution
    $ gcc -D_DIST -O3 -march=native test.c
    $ ./a.out
    start: 9221515.889463124
    end: 9221516.255742919
    diff: 0.366279795
    
    // Implementation Two (Branchless)
    $ gcc -D_OPT1 -O3 -march=native test.c
    $ ./a.out
    start: 9221472.539690988
    end: 9221472.787347636
    diff: 0.247656648
    
    // Implementation Three (Branchless Decrement)
    $ gcc -D_OPT2 -O3 -march=native test.c
    $ ./a.out
    start: 9221930.068693139
    end: 9221930.334575475
    diff: 0.265882336
    
    // Rotem's Implementation (SIMD)
    $ gcc -D_OPT3 -O3 -march=native test.c
    $ ./a.out
    start: 9222076.001094679
    end: 9222076.230432423
    diff: 0.229337744
    
    // Inuitive abs() Implementation
    $ gcc -D_OPT4 -O3 -march=native test.c
    $ ./a.out
    start: 9222112.523690484
    end: 9222112.754820240
    diff: 0.231129756
    // Inuitive abs() Implementation Without native
    $ gcc -D_OPT4 -O3 test.c
    $ ./a.out
    start: 9223301.744006196
    end: 9223301.974097927
    diff: 0.230091731
    

    Conclusion

    My take away from this is that hardware optimizations to handle branch predictions might significantly speed up code execution and improve your speed better than any software based optimizations. By trying to optimize out the branching, you've created code that executes the same steps no matter the data being processed. So while it executes in constant time, if the data is not perfect randomly distributed you might actually be making the execution speed slower.

    Update: I did some tests with compiler optimizations turned on and found different results that don't entirely support the conclusions I came to earlier.

    In my experience I have found that if you can simply write less code, that is normally the best way of optimization. It seems the less instructions, the faster it executes regardless of hardware features.

    I look forward to reading any comments on this exercise.

    Update

    I've added the results of Rotem's implementation. This code is super fast and proves that the less instructions you have, the faster the execution time. Nice work Rotem!

    Update 2

    I did some extensive testing today and found that micro optimizations like changing the way the for loop counts has absolutely no effect when compiler optimizations like gcc -O3 are turned on. The compiler ends up generating assembly that does pointer comparison on the array pointer to test if we've reached the end.

    Optimizing the SSE code provided by Rotem also makes no difference when the compiler is run with gcc -O3 as it properly aligns the memory on a 16 Byte boundary which removes the _mm_loadu_si128()/_mm_storeu_si128() necessity.

    Final Update

    I added another implementation which is using the simple and intuitive abs() function. It turns out abs() on gcc and MSVC is actually a compiler intrinsic. I redid all the test results simply using gcc -O3 optimizations.

    As you can see the Rotem's SIMD implementaiton and the abs() implementation are the fastest, followed by the two XOR implementations, and finally the branching implementations.

    Of the two XOR implementations the one that decrements the for loop is actually slightly slower as it's loop contains 14 instructions whereas the increment loop contains only 13.

    Rotem's SIMD implementation and the abs() implementation actually both rely on the PABSD instruction and both have loops containing 7 instructions. The slight difference in speed (SIMD being slightly faster) however comes from the fact that the optimized SIMD implementation assumes the memory will always contain multiples of 4 integers (128bits) whereas the abs() implementation requires extra instructions to test the cases where the memory does not contain multiples of 4 integers.

    What's amazing here is that by simply using abs() we can achieve almost exactly the same speed as SIMD with the simplicity of calling a C library function. The abs() loop without using -march=native is only 4 instructions longer, which instead of utilizing PABSD, it uses PSRAD, PXOR, and PSUBD instructions.

    Why is portable abs() faster than the XOR implementation?

    It turns out the portable (or non-native) abs() assembly is nearly exactly that of the XOR implementation.

    Here's the abs():

    psrad   $31, %xmm0
    pxor    %xmm0, %xmm1
    psubd   %xmm0, %xmm1
    

    Here's the XOR:

    psrad   $31, %xmm1
    movdqa  %xmm1, %xmm2
    pxor    %xmm1, %xmm0
    pand    %xmm3, %xmm2
    paddd   %xmm2, %xmm0
    

    Now lets convert these back into C code:

    Here's the abs():

    int ntemp = array[i] >> 31;
    array[i] ^= ntemp;
    array[i] -= ntemp;
    

    Here's the XOR:

    uint32_t ntemp = array[i] >> 31;
    array[i] ^= ntemp;
    array[i] += ntemp & 1;
    

    The difference is in the fact that we have an extra bitwise AND operation in the original XOR implementation.

    Final Conclusion

    Use abs()! :D