cssesimdsse2

_mm_load_si128 loads data in reverse order


I am writing a C function with SSE2 intrinsics to essentially compare 4 32 bit integers and check to see which are greater than zero, and give that result in the form of a 16 bit mask. I am using the following code to do this

#include <x86intrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>


static void cmp_example(void) {
    const uint32_t byte_vals[] = {0, 5, 0, 3};
    __m128i got_data = _mm_load_si128((__m128i const*)byte_vals);
    __m128i cmp_data = _mm_setzero_si128();
    __m128i result = _mm_cmpgt_epi32 (got_data, cmp_data);
    int mask_result = _mm_movemask_epi8(result);
    printf("Result 0x%x\n", mask_result & 0xFFFF);
}

However, when I compile and run this, it prints 0xf0f0 . I would expect the result to follow the same order in which it was loaded from memory. To check a little further, I added some debugging statements, which are as follows :

const uint32_t byte_vals[] = {0, 5, 0, 3};
__m128i got_data = _mm_load_si128((__m128i const*)byte_vals);
printf("0x%llx 0x%llx\n", got_data[0], got_data[1]);
__m128i cmp_data = _mm_setzero_si128();
__m128i result = _mm_cmpgt_epi32 (got_data, cmp_data);
printf("0x%llx 0x%llx\n", result[0], result[1]);
int mask_result = _mm_movemask_epi8(result);
printf("Result 0x%x\n", mask_result & 0xFFFF);

This run prints

0x500000000 0x300000000
0xffffffff00000000 0xffffffff00000000
Result 0xf0f0

Thus, it seems here the culprit is _mm_load_si128 .

Based on this, how can I get _mm_load_si128 to load data in the same order as it is laid out in memory ?


Solution

  • _mm_load_si128 loads the data in little endian format. Word 0 goes at least conceptually to element 0 in the xmm register.

    But when the values are printed as hexadecimal values, they are printed in big endian format. The first int64_t element of the xmm register got_data[0] contains the byte stream 00 00 00 00 05 00 00 00, which is 0x(000000)0500000000ull.

    Depending of the context, the values must be read left to right, or right to left. The 0th nibble of the mask (0x000F) corresponds to the 0th word of the result.