cinteger-divisionsingle-precision

How to keep precision on int64_t = int64_t * float?


I would like to perform a correction on an int64_t by a factor in the range [0.01..1.2] with precision is about 0.01. The naive implementation would be:

int64_t apply_correction(int64_t y, float32_t factor)
{
    return y * factor;
}

Unfortunately, I will loose precision either if I cast factor to int32 or if I cast y into float.

However, if I can ensure y has its maximum value below 1<<56, I can use this trick:

(1<<8) * (y / (int32_t)(factor * (1<<8)))

How can I solve this problem if my input value can be bigger than 1<<56?

Plot twist:

I am running on a 32-bit architecture where int64_t is an emulated type and where I don't have any support for double precision. The architecture is SHARC from Analog Devices.


Solution

  • If you calculate ((int64_t)1 << 57) * 100 or * 256, you will have a signed integer overflow, which would lead to your code having undefined behaviour. If instead you used uint64_t and the value, then your code would be well-defined but definedly ill-behaved.


    However it is possible to make this work for numbers almost up to (1 << 63 / 1.2).

    If y were an uint64_t you can split the original number to most-significant 32 bits shifted right by 32, and the least-significant 32 bits, multiply this by (int32_t)(factor * (1 << 8)).

    Then you do not right-shift the most-significant bits by 8 after the multiplication, but left-shift by 24; then add together:

    uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
    {
        uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 8));
        uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 8));     
        return (most_significant << 24) + (least_significant >> 8);
    }
    

    Now, apply_uint64_correction(1000000000000, 1.2) would result in 1199218750000, and apply_uint64_correction(1000000000000, 1.25) would result in 1250000000000.


    Actually you can make more precision out of it if you can guarantee the range of factor:

    uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
    {
        uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 24));
        uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 24));     
        return (most_significant << 8) + (least_significant >> 24);
    }
    

    apply_uint64_correction(1000000000000, 1.2) would give 1200000047683 on my computer; this is also the maximum precision you can get out of it, if float32_t has 24-bit mantissa.


    The above algorithm would work for signed positive numbers as well, but as signed shifts for negative numbers are a grey area I'd take note of the sign, then convert the value to uint64_t, do the calculations portably, and then negate if original sign was negative.

    int64_t apply_correction(int64_t y, float32_t factor) {
        int negative_result = 0;
        uint64_t positive_y = y;
        if (y < 0) {
            negative_result = 1;
            positive_y = -y;
        }
    
        uint64_t result = apply_uint64_correction(positive_y, factor);
        return negative_result ? -(int64_t)result : result;
    }