c++swapintrinsicsc++03sunstudio

How to swap two __m128i variables in C++03 given its an opaque type and an array?


What is the best practice for swapping __m128i variables?

The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b) fails under the compiler.


Here are some related questions that don't quite hit the mark. They don't apply because std::vector is not available.


Solution

  • This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a; doesn't compile, that's pretty bad.


    If you're going to write a custom swap function, keep it simple. __m128i is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.

    Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use = assignment to do the copying. That always works efficiently in any compiler that supports __m128i, and is a common pattern.

    Plain assignment to/from values in memory works like _mm_store_si128 / _mm_load_si128: i.e. movdqa aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)

    // alternate names: assignment_swap
    // or swap128, but then the name doesn't fit for __m256i...
    
    // __m128i t(a) errors, so just use simple initializers / assignment
    template<class T>
    void vecswap(T& a, T& b) {
        // T t = a;     // Apparently SunCC even choked on this
        T t;
        t = a;
        a = b;
        b = t;
    }
    

    Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3

    __m128i test_return2nd(__m128i x, __m128i y) {
        vecswap(x, y);
        return x;
    }
    
        movdqa    xmm0, xmm1
        ret                    # returning the 2nd arg, which was in xmm1
    
    
    __m128i test_return1st(__m128i x, __m128i y) {
        vecswap(x, y);
        return y;
    }
    
        ret                   # returning the first arg, already in xmm0
    

    With memswap, you get something like

    return1st_memcpy(__m128i, __m128i):        ## ICC13 -O3
        movdqa    XMMWORD PTR [-56+rsp], xmm0
        movdqa    XMMWORD PTR [-40+rsp], xmm1    # spill both
        movaps    xmm2, XMMWORD PTR [-56+rsp]    # reload x
        movaps    XMMWORD PTR [-24+rsp], xmm2    # copy x to tmp
        movaps    xmm0, XMMWORD PTR [-40+rsp]    # reload y
        movaps    XMMWORD PTR [-56+rsp], xmm0    # copy y to x
        movaps    xmm0, XMMWORD PTR [-24+rsp]    # reload tmp
        movaps    XMMWORD PTR [-40+rsp], xmm0    # copy tmp to y
        movdqa    xmm0, XMMWORD PTR [-40+rsp]    # reload y
        ret                                      # return y
    

    This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpys at all, or even remember what is left in a register.


    Swapping values already in memory

    Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).

    // the memcpy version of this compiles badly
    void test_mem(__m128i *x, __m128i *y) {
        vecswap(*x, *y);
    }
        # gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
        movdqa  xmm0, XMMWORD PTR [rdi]
        movdqa  xmm1, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm1
        movaps  XMMWORD PTR [rsi], xmm0
        ret
    
    // gcc 5.3 with memswap instead of vecswap.  ICC13 is similar
    test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
        mov     rax, QWORD PTR [rdi]
        mov     rdx, QWORD PTR [rdi+8]
        mov     r9, QWORD PTR [rsi]
        mov     r10, QWORD PTR [rsi+8]
        mov     QWORD PTR [rdi], r9
        mov     QWORD PTR [rdi+8], r10
        mov     QWORD PTR [rsi], rax
        mov     QWORD PTR [rsi+8], rdx
        ret