algorithmassemblybigintscientific-notationmasm64

Algorithm to convert binary big integer to scientific notation with truncation to a 1-digit mantissa


I have a very big unsigned binary big integer in the scale of 6*10^120. Assume the big integer is stored in a struct of many QWORD (8 bytes) unsigned integers or several YMM registers.

I want to display it in decimal (not binary) scientific notation like 6E120. The mantissa is always 1 digit and must be the leading digit of the full decimal representation; truncate it to 1 significant digit, not rounding to nearest. The exponent is always 3 digits. The format is aExyz like 8E095.

What is the most time-efficient (fastest) algorithm to find the order of magnitude (power of 10) and leading decimal digit? I am asking for the algorithm, not for the program, I will write it myself.

It will be done in MASM64 assembly language. If there are instructions that can help like bit manipulations or FPU/SSE/AVX512 tricks, please do suggest them.

This is not a high-level program so any responses that include third-party libraries or high-level language constructs are not helpful. I know a certain algorithm involves many divisions. Those are expensive in ASM so I'm looking for an alternative. I know how to convert from binary to decimal and then to scientific notation. I'm trying to avoid that middle step.


Solution

  • Assuming the maximum possible value is less than 1E154, and thus all values fit in 512 bits, than I suspect the answer might be:

    1. Precalculate all possible powers_of_10 in a static const array. (#ops ~0)
      • If all numbers fit in 2 YMM registers, that means the max supported value is 1E154, which means the lookup table of powers of 10 would take ~9856 bytes.
    2. In your number, calculate the number of leading binary zeroes (Using clz as well) (#ops < ~10)
    3. Use (max-bits - number_of_leading_zeroes) / 3.32192809489 to give a good estimate of the final number of decimal digits. This is also a good estimate a close power of 10. (#ops ~2)
    4. Iterate the powers_of_10 from that estimate until you find the largest power-of-10 less than your value. (#ops ~8)
      • If you're willing to sacrifice accuracy in the exponent, then this step can be skipped.
    5. Add that power of 10 to itself in a loop until greater than your input. (#ops < ~100) (10 additions of 10 uint64s)
      • If you're willing to sacrifice minor accuracy in the mantissa, then double(input)/double(power_of_ten) will do it in a single division.
      • There's lots of shortcuts that you can take in the bigint->double conversion.
    6. Emit loop_count-1 E power_of_ten_index. (#ops ~4)

    If you're willing to sacrifice accuracy in both exponent and mantissa, that leaves ~16 operations that completely ignore the low bits.

    Performance

    Without writing out a final implementation, performance is hard to guess at, and with a large LUT, cache, and therefore the rest of the program becomes a factor, but here's preliminary data: https://quick-bench.com/q/53k-xSQz7y4iCO7ny66Dz2w62dQ (I ran it several times to try to eliminate outliers)

    In my tests, the fastest combination seems to (unsurprisingly) be:

    From this baseline, we can see:

    It is also interesting to note that all of these are ~4x faster than converting the bigint to a double and then using printf("%.0E",value);

    (I do not vouch for the accuracy of the results of any of this code)