algorithmserializationfloating-pointdouble-double-arithmetic

How can I convert between a double-double and a decimal string?


One way of increasing precision beyond that of a double (e.g. if my application is doing something space-related that needs to represent accurate positions over distances of many light-years) is to use a double-double, a structure composed of two doubles which represents the sum of the two. Algorithms are known for the various arithmetic operations on such a structure, e.g. double-double + double-double, double × double-double, etc, e.g. as given in this paper.

(Note that this is not the same format as the IEEE 754-2008 binary128, a.k.a. quad-precision and conversion to/from double-double and binary128 is not guaranteed to round-trip.)

An obvious way to represent such a quantity as a string would then be to use strings representing each individual component of the double, e.g. "1.0+1.0e-200". My question is, is there a known way to convert to and from strings that represent the value as a single decimal? I.e. given the string "0.3" then provide the double-double closest to this representation, or go in the reverse direction. One naïve way would be to use successive multiplications/divisions by 10, but that is insufficient for doubles so I'm somewhat sceptical that they would work here.


Solution

  • such technique as summing 2 floating point variables just effectively doubles the mantissa bitwidth so its enough to just store/load bigger mantissa.

    Standard IEEE 754 double has 52+1 bit mantissa leading to

    log10(2^53) = 15.95 = ~16 [dec digits]
    

    so when you add 2 such variables then:

    log10(2^(53+53)) = 31.9 = ~32 [dec digits]
    

    so just store/load 32 digit mantissa to/from string. The exponent of the 2 variables will differ by +/- 53 so its enough to store just one of them.

    To further improve performance and precision you can use hex strings. Its much faster and there is no rounding as you can directly convert between the mantissa bits and hex string characters.

    any 4 bits form a single hexadecimal digit so

    (53+53) / 4 = 26.5 = ~27 [hex digits]
    

    As you can see its also more storage efficient the only problem is the exponent delimiter as hexa digits contain E so you need to distinct the digit and exponent separator by upper/lower casing or use different character or use just sign for example:

    1.23456789ABCDEFe10  
    1.23456789ABCDEFe+10
    1.23456789ABCDEF|+10
    1.23456789ABCDEF+10
    

    I usually use the first version. Also you need to take in mind the exponent is bit shift of mantissa so resulting number is:

    mantisa<<exponent = mantisa * (2^exponent)
    

    Now during loading/storing from/to string you just load 53+53 bit integer number then separate it to 2 mantissas and reconstruct the floating point values at bit level ... Its important that your mantissas are aligned so exp1+53 = exp2 give or take 1 ...

    All this can be done on integer arithmetics.

    If your exponent is exp10 then you will inflict heavy rounding on the number during both storage and loading to/from string as your mantissa will usually missing many zero bits before or after the decimal point making transformation between decadic and binary/hexadecimal very hard and inaccurate (especially if you limit your computation just to 64/80/128/160 bits of mantissa).

    Here an C++ example of just that (printing 32bit float in decadic on integer arithmetics only):

    //---------------------------------------------------------------------------
    AnsiString f32_prn(float fx)    // scientific format integers only
        {
        const int ms=10+5;  // mantisa digits
        const int es=2;     // exponent digits
        const int eb=100000;// 10^(es+3)
        const int sz=ms+es+5;
    
        char txt[sz],c;
        int i=0,i0,i1,m,n,exp,e2,e10;
        DWORD x,y,man;
        for (i0=0;i0<sz;i0++) txt[i0]=' ';
        // float -> DWORD
        x=((DWORD*)(&fx))[0];
        // sign
        if (x>=0x80000000){ txt[i]='-'; i++; x&=0x7FFFFFFF; }
         else             { txt[i]='+'; i++; }
        // exp
        exp=((x>>23)&255)-127;
        // man
        man=x&0x007FFFFF;
        if ((exp!=-127)&&(exp!=+128)) man|=0x00800000;  // not zero or denormalized or Inf/NaN
        // special cases
        if ((man==0)&&(exp==-127)){ txt[i]='0'; i++; txt[i]=0; return txt; }    // +/- zero
        if ((man==0)&&(exp==+128)){ txt[i]='I'; i++;
                                    txt[i]='N'; i++;
                                    txt[i]='F'; i++; txt[i]=0; return txt; }    // +/- Infinity
        if ((man!=0)&&(exp==+128)){ txt[i]='N'; i++;
                                    txt[i]='A'; i++;
                                    txt[i]='N'; i++; txt[i]=0; return txt; }    // +/- Not a number
        // align man,exp to 4bit
        e2=(1+(exp&3))&3;
        man<<=e2;
        exp-=e2+23; // exp of lsb of mantisa
        e10=0;      // decimal digits to add/remove
        m=0;        // mantisa digits
        n=ms;       // max mantisa digits
        // integer part
        if (exp>=-28)
            {
            x=man; y=0; e2=exp;
            // shift x to integer part <<
            if (x) for (;e2>0;)
                {
                while (x>0x0FFFFFFF){ y/=10; y+=((x%10)<<28)/10; x/=10; e10++; }
                e2-=4; x<<=4; y<<=4;
                x+=(y>>28)&15; y&=0x0FFFFFFF;
                }
            // shift x to integer part >>
            for (;e2<0;e2+=4) x>>=4;
            // no exponent?
            if ((e10>0)&&(e10<=es+3)) n++;  // no '.'
            // print
            for (i0=i;x;)
                {
                if (m<n){ txt[i]='0'+(x%10); i++; m++; if ((m==n)&&(x<eb)) m+=es+1; } else e10++;
                x/=10;
                }
            // reverse digits
            for (i1=i-1;i0<i1;i0++,i1--){ c=txt[i0]; txt[i0]=txt[i1]; txt[i1]=c; }
            }
        // fractional part
        if (exp<0)
            {
            x=man; y=0; e2=exp;
            // shift x to fractional part <<
            if (x) for (;e2<-28;)
                {
                while ((x<=0x19999999)&&(y<=0x19999999)){ y*=10; x*=10; x+=(y>>28)&15; y&=0x0FFFFFFF; e10--; }
                y>>=4; y&=0x00FFFFFF; y|=(x&15)<<24;
                x>>=4; x&=0x0FFFFFFF; e2+=4;
                }
            // shift x to fractional part <<
            for (;e2>-28;e2-=4) x<<=4;
            // print
            x&=0x0FFFFFFF;
            if ((m)&&(!e10)) n+=es+2;   // no exponent means more digits for mantisa
            if (x)
                {
                if (m){ txt[i]='.'; i++; }
                for (i0=i;x;)
                    {
                    y*=10; x*=10;
                    x+=(y>>28)&15;
                    if (m<n)
                        {
                        i0=((x>>28)&15);
                        if (!m)
                            {
                            if (i0)
                                {
                                txt[i]='0'+i0; i++; m++;
                                txt[i]='.';    i++;
                                }
                            e10--;
                            if (!e10) n+=es+2;  // no exponent means more digits for mantisa
                            }
                        else { txt[i]='0'+i0; i++; m++; }
                        } else break;
                    y&=0x0FFFFFFF;
                    x&=0x0FFFFFFF;
                    }
                }
            }
        else{
            // no fractional part
            if ((e10>0)&&(e10<sz-i))
             for (;e10;e10--){ txt[i]='0'+i0; i++; m++; }
            }
        // exponent
        if (e10)
            {
            if (e10>0)  // move . after first digit
                {
                for (i0=i;i0>2;i0--) txt[i0]=txt[i0-1];
                txt[2]='.'; i++; e10+=i-3;
                }
            // sign
            txt[i]='E'; i++;
            if (e10<0.0){ txt[i]='-'; i++; e10=-e10; }
             else       { txt[i]='+'; i++; }
            // print
            for (i0=i;e10;){ txt[i]='0'+(e10%10); e10/=10; i++; }
            // reverse digits
            for (i1=i-1;i0<i1;i0++,i1--){ c=txt[i0]; txt[i0]=txt[i1]; txt[i1]=c; }
            }
    
        txt[i]=0;
        return txt;
        }
    //---------------------------------------------------------------------------
    

    Just change the AnsiString return type into any string type or char* you got at your disposal ...

    As you can see its a lot of code with a lot of hacks and internally a lot more than 24bit of mantissa is used to lower the rounding errors inflicted by decadic exponent.

    So I strongly advice to use binary exponent (exp2) and hexa digits for mantissa it will simplify your problem a lot and get rid of the rounding entirely. The only problem is when you want print or input decadic number in such case you have no choice but to round ... Luckily you can use hexa output and convert it to decadic on strings... Or construct the print from single variable prints ...

    for more info see related QAs: