c++11 gcc initializer-list bad-alloc mmx

bad_alloc with unordered_map initializer_list and MMX instruction, possible heap corruption?

I am getting a bad_alloc thrown from the code below compiled with gcc (tried 4.9.3, 5.40 and 6.2). gdb tells me it happens on the last line with the initalizer_list for the unordered_map. If I comment out the mmx instruction _m_maskmovq there is no error. Similarly if I comment out the initialization of the unordered_map this is no error. Only when invoking the mmx instruction and initializing the unordered_map with an initializer_list do I get the bad_alloc. If I default construct the unordered_map and call map.emplace(1,1) there is also no error. I've run this on a centos7 machine with 48 cores (intel xeon) and 376 GB RAM and also on a Dell laptop (intel core i7) under Ubuntu WSL with the same result. What is going on here? Is the MMX instruction corrupting the heap? Valgrind didn't seem to identify anything useful.

Compiler command and output:

$g++ -g -std=c++11 main.cpp
$./a.out
   terminate called after throwing an instance of 'std::bad_alloc'
   what():  std::bad_alloc
   Aborted

Source code (main.cpp):

#include <immintrin.h>
#include <unordered_map>

int main()
{
  __m64 a_64 = _mm_set_pi8(0,0,0,0,0,0,0,0);
  __m64 b_64 = _mm_set_pi8(0,0,0,0,0,0,0,0);
  char dest[8] = {0};
  _m_maskmovq(a_64, b_64, dest);

  std::unordered_map<int, int> map{{ 1, 1}};
}

Update: The _mm_empty() workaround does fix this example. This doesn't seem like a viable solution when using multithreaded code where one thread is doing vector instructions and another is using an unordered_map. Another interesting point, if I turn optimization on -O3 the bad_alloc goes away. Fingers crossed we never hit this error during production (cringe).

Solution

There is no heap corruption. This happens because std::unordered_map uses long double internally, for computing the bucket count from the number of elements in the initializer (see _Prime_rehash_policy::_M_bkt_for_elements in the libstdc++ sources).

It is necessary to call _mm_empty before switching from MMX code to FPU code. This has to do with a historic decision to reuse the FPU registers for the MMX register file (sort of the opposite of register renaming in modern CPUs).

The exception goes away if the _mm_empty call is added:

…
  _m_maskmovq(a_64, b_64, dest);
  _mm_empty();
  std::unordered_map<int, int> map{{ 1, 1}};
…

See GCC PR 88998, as identified by cpplearner.

There is ongoing work to implement the MMX intrinsics with SSE on x86-64, which will make this issue disappear because SSE instructions do not affect the FPU state and vice versa.