c++floating-pointssesimdavx

How to efficiently perform double/int64 conversions with SSE/AVX?


SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.

But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing:

It seems that AVX doesn't have them either.

What is the most efficient way to simulate these intrinsics?


Solution

  • There's no single instruction until AVX512, which added conversion to/from 64-bit integers, signed or unsigned. (Also support for conversion to/from 32-bit unsigned). See intrinsics like _mm512_cvtpd_epi64 and the narrower AVX512VL versions, like _mm256_cvtpd_epi64.

    If you only have AVX2 or less, you'll need tricks like below for packed-conversion. (For scalar, x86-64 has scalar int64_t <-> double or float from SSE2, but scalar uint64_t <-> FP requires tricks until AVX512 adds unsigned conversions. Scalar 32-bit unsigned can be done by zero-extending to 64-bit signed.)


    If you're willing to cut corners, double <-> int64 conversions can be done in only two instructions:

    double -> uint64_t

    //  Only works for inputs in the range: [0, 2^52)
    __m128i double_to_uint64(__m128d x){
        x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
        return _mm_xor_si128(
            _mm_castpd_si128(x),
            _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
        );
    }
    

    double -> int64_t

    //  Only works for inputs in the range: [-2^51, 2^51]
    __m128i double_to_int64(__m128d x){
        x = _mm_add_pd(x, _mm_set1_pd(0x0018000000000000));
        return _mm_sub_epi64(
            _mm_castpd_si128(x),
            _mm_castpd_si128(_mm_set1_pd(0x0018000000000000))
        );
    }
    

    uint64_t -> double

    //  Only works for inputs in the range: [0, 2^52)
    __m128d uint64_to_double(__m128i x){
        x = _mm_or_si128(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)));
        return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0010000000000000));
    }
    

    int64_t -> double

    //  Only works for inputs in the range: [-2^51, 2^51]
    __m128d int64_to_double(__m128i x){
        x = _mm_add_epi64(x, _mm_castpd_si128(_mm_set1_pd(0x0018000000000000)));
        return _mm_sub_pd(_mm_castsi128_pd(x), _mm_set1_pd(0x0018000000000000));
    }
    

    Rounding Behavior:


    How does it work?

    Despite this trick being only 2 instructions, it's not entirely self-explanatory.

    The key is to recognize that for double-precision floating-point, values in the range [2^52, 2^53) have the "binary place" just below the lowest bit of the mantissa. In other words, if you zero out the exponent and sign bits, the mantissa becomes precisely the integer representation.

    To convert x from double -> uint64_t, you add the magic number M which is the floating-point value of 2^52. This puts x into the "normalized" range of [2^52, 2^53) and conveniently rounds away the fractional part bits.

    Now all that's left is to remove the upper 12 bits. This is easily done by masking it out. The fastest way is to recognize that those upper 12 bits are identical to those of M. So rather than introducing an additional mask constant, we can simply subtract or XOR by M. XOR has more throughput.

    Converting from uint64_t -> double is simply the reverse of this process. You add back the exponent bits of M. Then un-normalize the number by subtracting M in floating-point.

    The signed integer conversions are slightly trickier since you need to deal with the 2's complement sign-extension. I'll leave those as an exercise for the reader.

    Related: A fast method to round a double to a 32-bit int explained


    Full Range int64 -> double:

    After many years, I finally had a need for this.

    uint64_t -> double

    __m128d uint64_to_double_full(__m128i x){
        __m128i xH = _mm_srli_epi64(x, 32);
        xH = _mm_or_si128(xH, _mm_castpd_si128(_mm_set1_pd(19342813113834066795298816.)));          //  2^84
        __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0xcc);   //  2^52
        __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(19342813118337666422669312.));     //  2^84 + 2^52
        return _mm_add_pd(f, _mm_castsi128_pd(xL));
    }
    

    int64_t -> double

    __m128d int64_to_double_full(__m128i x){
        __m128i xH = _mm_srai_epi32(x, 16);
        xH = _mm_blend_epi16(xH, _mm_setzero_si128(), 0x33);
        xH = _mm_add_epi64(xH, _mm_castpd_si128(_mm_set1_pd(442721857769029238784.)));              //  3*2^67
        __m128i xL = _mm_blend_epi16(x, _mm_castpd_si128(_mm_set1_pd(0x0010000000000000)), 0x88);   //  2^52
        __m128d f = _mm_sub_pd(_mm_castsi128_pd(xH), _mm_set1_pd(442726361368656609280.));          //  3*2^67 + 2^52
        return _mm_add_pd(f, _mm_castsi128_pd(xL));
    }
    

    These work for the entire 64-bit range and are correctly rounded to the current rounding behavior.

    These are similar wim's answer below - but with more abusive optimizations. As such, deciphering these will also be left as an exercise to the reader.