What is the best practice for swapping __m128i
variables?
The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i
is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]
. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b)
fails under the compiler.
Here are some related questions that don't quite hit the mark. They don't apply because std::vector
is not available.
This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a;
doesn't compile, that's pretty bad.
If you're going to write a custom swap function, keep it simple. __m128i
is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.
Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use =
assignment to do the copying. That always works efficiently in any compiler that supports __m128i
, and is a common pattern.
Plain assignment to/from values in memory works like _mm_store_si128
/ _mm_load_si128
: i.e. movdqa
aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)
// alternate names: assignment_swap
// or swap128, but then the name doesn't fit for __m256i...
// __m128i t(a) errors, so just use simple initializers / assignment
template<class T>
void vecswap(T& a, T& b) {
// T t = a; // Apparently SunCC even choked on this
T t;
t = a;
a = b;
b = t;
}
Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3
__m128i test_return2nd(__m128i x, __m128i y) {
vecswap(x, y);
return x;
}
movdqa xmm0, xmm1
ret # returning the 2nd arg, which was in xmm1
__m128i test_return1st(__m128i x, __m128i y) {
vecswap(x, y);
return y;
}
ret # returning the first arg, already in xmm0
With memswap, you get something like
return1st_memcpy(__m128i, __m128i): ## ICC13 -O3
movdqa XMMWORD PTR [-56+rsp], xmm0
movdqa XMMWORD PTR [-40+rsp], xmm1 # spill both
movaps xmm2, XMMWORD PTR [-56+rsp] # reload x
movaps XMMWORD PTR [-24+rsp], xmm2 # copy x to tmp
movaps xmm0, XMMWORD PTR [-40+rsp] # reload y
movaps XMMWORD PTR [-56+rsp], xmm0 # copy y to x
movaps xmm0, XMMWORD PTR [-24+rsp] # reload tmp
movaps XMMWORD PTR [-40+rsp], xmm0 # copy tmp to y
movdqa xmm0, XMMWORD PTR [-40+rsp] # reload y
ret # return y
This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpy
s at all, or even remember what is left in a register.
Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).
// the memcpy version of this compiles badly
void test_mem(__m128i *x, __m128i *y) {
vecswap(*x, *y);
}
# gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
movdqa xmm0, XMMWORD PTR [rdi]
movdqa xmm1, XMMWORD PTR [rsi]
movaps XMMWORD PTR [rdi], xmm1
movaps XMMWORD PTR [rsi], xmm0
ret
// gcc 5.3 with memswap instead of vecswap. ICC13 is similar
test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rdi+8]
mov r9, QWORD PTR [rsi]
mov r10, QWORD PTR [rsi+8]
mov QWORD PTR [rdi], r9
mov QWORD PTR [rdi+8], r10
mov QWORD PTR [rsi], rax
mov QWORD PTR [rsi+8], rdx
ret