javascriptchashfloating-pointwebassembly

Double floating point operations using four IEEE rounding modes implemented in terms of round to nearest ties to even


I am looking to implement the RandomX Proof of Work hash function in JavaScript + WebAssembly. The README explains how this isn't possible:

Web mining is infeasible due to the large memory requirement and the lack of directed rounding support for floating point operations in both Javascript and WebAssembly.

Due to the pervasive use of repeatedly switching the rounding modes using a CFROUND instruction inside the bytecode and the fact that JavaScript has no ability to change the rounding mode or even use any other rounding mode other than ties to even, its impossible.

The CFROUND instruction manipulates the fprc abstract register inside the RandomX spec. fprc starts at zero, so round ties to even and is updated seemingly randomly on every instruction execution.

fprc rounding mode
0 roundTiesToEven
1 roundTowardNegative
2 roundTowardPositive
3 roundTowardZero

RandomX uses 5 properly rounded operations, add, sub, mul, div, and sqrt guaranteed to be properly rounded by the IEEE754 spec in double precision. My question is this. Soft float is not an option, I have access to floating point arithmetic only in round ties to even for those 5 operations. I need to find a way to emulate double precision arithmetic with all of the other rounding modes implemented in terms of one rounding mode for those 5 operations. This would be much faster than soft float.

Bruce at the Random ASCII blog had something to say:

If I understand correctly then the other three rounding modes will produce results that are either identical to round-to-nearest-even or one ULP away. If the round-to-nearest-even operation is exact then all rounding modes will give the same result. If it is not exact then you’d have to figure out which results would be different. I think that either round-towards-negative or round-towards-positive would be one ULP away, but not both.

FMA (fused multiply add) operations aren't available in JavaScript, however they are available in WebAssembly but gated behind non deterministic operations (which aren't widely supported). Those operations are f32x4.relaxed_*madd and f64x2.relaxed_*madd. If the use of FMA is required, this is fine, I can implement fallbacks.

The only paper that was even slightly close to floating point rounding mode emulation was Emulating round-to-nearest ties-to-zero ”augmented” floating-point operations using round-to-nearest ties-to-even arithmetic. It makes some interesting points but not useful.

Additional information 0: RandomX does operations using two classes of abstract floating point registers, group F (additive operations) and group E (multiplicative operations). According to the specification the absolute value of group F will never exceed 3.0e+14 and the approximate range of group E is 1.7E-77 to infinity (always positive). All results of floating point operations can NEVER be NaN. Answers may take this into account.

Additional information 1 (address @sech1 comment): The FSCAL_R instruction complicates things. Executing the instruction efficiently requires manipulating the actual floating point binary format:

The mathematical operation described above is equivalent to a bitwise XOR of the binary representation with the value of 0x80F0000000000000.

Which means alternative forms of representation (double-double, fixed point) are out of the picture:

Because the limited range of group F registers would allow the use of a more efficient fixed-point representation (with 80-bit numbers), the FSCAL instruction manipulates the binary representation of the floating point format to make this optimization more difficult.

There is a possibility that double-double arithmetic might work, if the FSCAL_R instruction could be splayed among both doubles.

Additional information 2 (address @aka.nice @sech1):

All floating point operations are rounded according to the current value of the fprc register (see Table 4.3.1). Due to restrictions on the values of the floating point registers, no operation results in NaN or a denormal number.

I have begin performing calculations constraining all output numbers within their ranges for the operations being performed on group E and F registers (no denormals, no NaNs, etc), with results in answers that are off by a single bit or resulting in infinity.

I have taken @aka.nice's suggestion on implementing operations using two-product. The testing harness and failure/success output are here. An excerpt is placed below:

FAIL: [mul(FE_UPWARD)] truth(...) == 0x1.00c915254a87ep+968 != op(...) == 0x1.00c915254a87dp+968
mul(FE_UPWARD): 19/20
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.fffffffffffffp+1023 != op(...) == inf
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.00c915254a87dp+968 != op(...) == 0x1.00c915254a87cp+968
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.fffffffffffffp+1023 != op(...) == inf
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.fffffffffffffp+1023 != op(...) == inf
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.fffffffffffffp+1023 != op(...) == inf
FAIL: [mul(FE_DOWNWARD)] truth(...) == 0x1.fffffffffffffp+1023 != op(...) == inf
mul(FE_DOWNWARD): 14/20
FAIL: [div(FE_TOWARDZERO)] truth(...) == 0x1.71bb8430246b4p-58 != op(...) == 0x1.71bb8430246b3p-58
FAIL: [div(FE_TOWARDZERO)] truth(...) == 0x1.c254556a24a56p+73 != op(...) == 0x1.c254556a24a55p+73
FAIL: [div(FE_TOWARDZERO)] truth(...) == 0x1.a67a7c9cb8d59p+483 != op(...) == 0x1.a67a7c9cb8d58p+483

Additional information 3 (address @aka.nice):

I've made great progress getting closer, currently have implemented addition and subtraction, 98% of multiplication and 40% of division and square root.

I've adjusted the round_toward function to have certain behavior around the "round down" around infinity as per the RandomX VM implemented in OpenCL.

double round_toward(double c, double res, int fe) {
    if (isinf(c)) {
        // fe & 1
        uint64_t inf_rnd = (2047UL << 52) - (fe == FE_DOWNWARD || fe == FE_TOWARDZERO);
        return u64_double(inf_rnd);
    }
    ...

The implementation of two-product without FMA lags behind by 1%:

void split64(double x, int delta, double *x_h, double *x_l) {
    // https://github.com/dsiem/ErrorFreeTransforms.jl/blob/master/src/mul.jl

    double f = (1UL << delta) + 1;
    double c = x * f;
    double h = c - (c - x);
    double l = x - h;

    *x_h = h;
    *x_l = l;
}

// 100% success
double mul_residue(double a, double b, double c) {
    return fma(a, b, -c);
}

// 99% success
double mul_residue(double a, double b, double c) {
    double a_h, a_l;
    double b_h, b_l;

    split64(a, 27, &a_h, &a_l);
    split64(b, 27, &b_h, &b_l);

    return a_l * b_l - (((c - a_h * b_h) - a_l * b_h) - a_h * b_l);
}

Code output is as such with the two-product EFT, not matching the use of the FMA:

FAIL: [mul(FE_TOWARDZERO)] truth(0x1.3459aa1f9d423p-244, 0x1.47ebd8e19bd97p+1017) == 0x1.8afa9bd8f2f64p+773 != op(...) == 0x1.8afa9bd8f2f65p+773
FAIL: [mul(FE_TOWARDZERO)] truth(0x1.36a0e20b199b7p+997, 0x1.441f01389c628p-96) == 0x1.89493d0cbd95dp+901 != op(...) == 0x1.89493d0cbd95ep+901
mul(FE_TOWARDZERO): 498/500
FAIL: [mul(FE_UPWARD)] truth(0x1.4c46f791b3cb9p+1001, 0x1.a55d5da424cd1p-165) == 0x1.1174f23ab0e37p+837 != op(...) == 0x1.1174f23ab0e36p+837
FAIL: [mul(FE_UPWARD)] truth(0x1.8c3c4851d9b75p-162, 0x1.ae85675639db9p+1006) == 0x1.4d2dde5e628f5p+845 != op(...) == 0x1.4d2dde5e628f4p+845
mul(FE_UPWARD): 498/500
FAIL: [mul(FE_DOWNWARD)] truth(0x1.3459aa1f9d423p-244, 0x1.47ebd8e19bd97p+1017) == 0x1.8afa9bd8f2f64p+773 != op(...) == 0x1.8afa9bd8f2f65p+773
FAIL: [mul(FE_DOWNWARD)] truth(0x1.36a0e20b199b7p+997, 0x1.441f01389c628p-96) == 0x1.89493d0cbd95dp+901 != op(...) == 0x1.89493d0cbd95ep+901
mul(FE_DOWNWARD): 498/500

The current implementation is here. I have been getting quite close, its just the edge cases and implementations with division and square root. It would be nice to have the mul_residue function the exact same with or without an fma() so that optimisations don't change the output.

Additional information 4 (address @aka.nice):

The current implementation passes all tests taking a sample of 500 thousand floats each for every operation and rounding mode, but only when mul_residue is implemented using fma(). The code for split() and mul_residue() is shown below:

void split64(double a, double *x_h, double *x_l) {
    double f = (1UL << 26) + 1;
    double c = f * a;
    double x = c - (c - a);
    double y = a - x;

    *x_h = x;
    *x_l = y;
}

double mul_residue(double a, double b, double c) {
    // 100% pass rate
    //return fma(a, b, -c);

    // 90% pass rate
    double a1, a2;
    double b1, b2;

    split64(a, &a1, &a2);
    split64(b, &b1, &b2);

    double r1 = a * b - (a1 * b1);
    double r2 = r1 - (a1 * b2);
    double r3 = r2 - (a2 * b1);
    double res1 = a2 * b2 - r3;
    double res2 = a * b - c;

    return res1 + res2;
}

Solution

Use the below version of emulated_fma instead of the above, then you'll end up with 100% pass rates.

static inline double upper_half(double x) {
    double secator = 134217729.0;
    double p = x * secator;
    return p + (x - p);
}

static inline double emulated_fma(double a, double b, double c) {
    double aup = upper_half(a);
    double alo = a - aup;
    double bup = upper_half(b);
    double blo = b - bup;

    double high = aup * bup; 
    double mid = aup * blo + alo * bup;
    double low = alo * blo;
    double ab = high + mid;
    double resab = (high - ab) + mid + low;

    double fma = ab + c;
    return resab + fma;
}

static inline double mul_residue(double a, double b, double c) {
    return emulated_fma(a, b, -c);
}

Average overhead was around x10 to x20, overheads with FMA available are around x4 to x12. Quite happy with the results.

fadd:
  fprc(0): 8clks
  fprc(1): 55clks [100000/100000] x6 overhead
  fprc(2): 29clks [100000/100000] x3 overhead
  fprc(3): 21clks [100000/100000] x2 overhead
fmul:
  fprc(0): 2clks
  fprc(1): 76clks [100000/100000] x28 overhead
  fprc(2): 38clks [100000/100000] x14 overhead
  fprc(3): 38clks [100000/100000] x14 overhead
fmul (fma):
  fprc(0): 15clks
  fprc(1): 31clks [100000/100000] x2 overhead
  fprc(2): 30clks [100000/100000] x2 overhead
  fprc(3): 46clks [100000/100000] x3 overhead

All code is present here: https://github.com/l1mey112/fp-rounding-emulation


Solution

  • Without FMA, you could use the two-product algorithm

    The two-product produces an exact multiplication, provided that the exponent does not overflow or underflow, in the form of a sum two floats :

    a x b = round(a x b) + residue
    

    If the residue is positive and we shall round upward, then the correct answer is the next float after round(a x b) toward positive infinity.

    For division, we can obtain a residue too by multiplication:

    round(a/b) x b = round( round(a/b) x b) + residue_1
    

    The residue what we seek is a bit different:

    (a/b) x b = round(a/b) x b + residue
    

    Thus, the residue of interest is:

    residue = a - round(a/b) x b
    residue = (a - round( round(a/b) x b)) - residue_1
    

    Similarly, for sqrt:

    round(sqrt(a)) x round(sqrt(a)) = round(round(sqrt(a)) x round(sqrt(a))) + residue_1
    
    a = round(sqrt(a)) x round(sqrt(a)) + residue
    
    residue = (a - round(round(sqrt(a)) x round(sqrt(a)))) - residue_1
    

    Since we are interested by the sign of residue rather than proper value, we can apply a 2^n scale on operands so as to avoid the overflow/underflow conditions

    Here is a sketch of solution using two-sum and two-product so as to find the residual rounding error.

    This solution requires using some implementation of standard nextafter() function, which can be found from various places like https://github.com/scijs/nextafter/blob/master/README.md

    Excuse my javascript skills which might be very approximative, but you'll get the idea.

    const TO_NEAREST = 0;
    const TO_NEGATIVE_INFINITY: = 1;
    const TO_POSITIVE_INFINITY: = 2;
    const TO_ZERO: = 3;
    const ROUNDING_DIRECTION=[Number.NaN , Number.Number.NEGATIVE_INFINITY , Number.POSITIVE_INFINITY , 0.0];
    
    function split(a)
    // helper function for two product, split a number intwo parts a=x+y
    // each part having no more than 26 consecutive bits set
    {
        let splitter=1+Math.pow(2,26); // splitter of two-product algorithm for double precision
        let c = factor*a; // BEWARE: might overflow for big a, but this case has been excluded a priori
        let x = c - (c-a);
        let y = a - x;
        return [x,y];
    }
    
    function mul_residue(a,b,c)
    // answer the residue of exact operation a*b-c
    // using the two product algorithm
    // could be expressed as fma(a,b,-c)
    // limitations: might fail if operations overflow/underflow
    {
        let [a1,a2] = split(a);
        let [b1,b2] = split(b);
        let r1 = a*b - (a1*b1);
        let r2 = r1 - (a1*b2);
        let r3 = r2 - (a2*b1);
        let res1 = a2*b2 - r3;
        let res2 = a*b - c;
        return res1 + res2;
    }
    
    function mul_residue_scaled(a,b,c)
    // answer the residue of exact operation a*b-c
    // after applying some scaling to avoid overflow/underflow
    {
        let [a_scaled , expa] = frexp(a);
        let [b_scaled , expb] = frexp(b);
        let c_scaled = ldexp(c , -expa-expb );
        return mul_residue(a_scaled,b_scaled,c_scaled)
    }
    
    function sum_residue(a,b,c)
    // answer the residue of operation c=a+b using the two sum algorithm
    {
        let a1=c-b;
        let b1=c-a;
        let delta_a=a-a1;
        let delta_b=b-b1;
        let res=delta_a+delta_b;
        return res;
    }
    
    function round_toward(c,res,mode)
    // round the result c to upper or lower, depending on the residue and rounding mode
    {
        let direction=ROUNDING_DIRECTION[ mode ];
        if(res > 0.0 && c < direction) return nextafter(c,direction);
        if(res < 0.0 && c > direction) return nextafter(c,direction);
        return c;
    }
    
    function round_sum(a,b,mode)
    {
        let c=a+b;
        // directed rounding might avoid case of overflow
        if ( (! isFinite(c)) && isFinite(a) && isFinite(b) && (b != 0)) return round_toward(c,-c,mode);
        res=sum_residue(a,b,c);
        return round_toward(c,res,mode);
    }
    
    function round_sub(a,b,mode)
    {
        return round_sum(a,-b,mode);
    }
    
    function round_mul(a,b,mode)
    {
        let c=a*b;
        if ( (! isFinite(c)) && isFinite(a) && isFinite(b) && (b != 0)) return round_toward(c,-c,mode);
        let res=mul_residue_scaled(a,b,c);
        return round_toward(c,res,mode);
    }
    
    function round_div(a,b,mode)
    {
        let c=a/b;
        if ( (! isFinite(c)) && isFinite(a) && isFinite(b) && (b != 0)) return round_toward(c,-c,mode);
        let res= mul_residue_scaled(c,b,a);
        return round_toward(c,-res * Math.sign(b),mode);
    }
    
    function round_sqrt(a,mode)
    {
        let c=Math.sqrt(a);
        let res=mul_residue_scaled(c,c,a);
        return round_toward(c,-res,mode);
    }
    

    Here is a candidate implementation for emulating frexp and ldexp.
    Note that multiplying by a power of two is an exact operation, except when the result underflows, this ldexp emulation takes care of such conditions.

    function exponent_bits(x)
    // extract the 11 bits of float representation encoding the exponent part
    {
        let float = new Float64Array(1),
        let bytes = new Uint8Array(float.buffer);
        float[0] = x;
        return ((bytes[7] & 0x7f) << 4 | bytes[6] >> 4);
    }
    
    function exponent(x)
    // answer the exponent of x
    // where non zero finite x can be represented as
    //    sign * significand * 2^exponent
    // with significand in the interval (0.5,1(
    // undefined behaviour for special values infinity and NaN
    {
        if(x === 0) return 0;
        let e = exponent_bits(x);
        if(e === 0) return exponent( x*Math.pow(2.0,53) ) - 53;
        return e - 1022;
    }
    
    function frexp(x)
    // emulate stdlib frexp function
    // answer the number scaled by a power of two so that it's magnitude fall in interval (0.5,1(
    // and the associated power
    {
        if( ! isFinite(x)) return [x,0];
        if( x == 0 ) return [x,0];
        let e = exponent(x);
        return [ldexp(x,-e) , e];
    }
    
    function ldexp(x,n)
    // emulate stdlib ldexp function
    // scale a number by a power of two : x * 2^n
    {
        if( ! isFinite(x)) return x;
        if( x == 0 ) return x;
        if( n > 1023 ) return ldexp(ldexp(x,1023),n-1023);
        if( n >= -1074) return x * Math.pow(2.0,n);
        // care to avoid inexact result in case of underflow
        let delta_to_underflow = Math.max( exponent(x) - 1022 , -1074 );
        // handle case of total cancellation
        if (delta_to_underflow >= 0) return ldexp(ldexp(x,-1022),n+1022);
        ldexp(ldexp(x,delta_to_underflow),n-delta_to_underflow);
    }