rubyfloating-pointdecimal

Correct Decimal to Floating-Point using Big Integers in Ruby


I'm trying to implement, for learning purposes, the core part of an algorithm that converts decimal strings to 64 bit floating point numbers.

I'm using the explanation from this page as a guide: https://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers

I'm doing this in Ruby because Ruby's Integer type is implemented as big integers.

This is what I have so far:

# frozen_string_literal: true

def create_float sign, exponent, mantissa
  v = String.new
  v << sign.to_s(2)
  v << exponent.to_s(2).rjust(11, '0')
  v << mantissa.to_s(2).rjust(52, '0')
  [v].pack('B*').unpack1('G')
end

def get_scale value
  scale = 0
  if value >= (2**53)
    while value >= (2**53)
      value /= 2
      scale += 1
    end
  else
    while value < (2**52)
      value *= 2
      scale += 1
    end
  end
  scale
end

def to_double value, exponent
  if exponent >= 0
    e = (10**exponent)
    t = value * e
    s = get_scale t
    q = t.div(2**s)
    r = t - q * 2**s
    z = (52 + s) + 1023
  else
    e = (10**-exponent)
    s = get_scale value.div(e)
    t = value * (2**s)
    q = t / e
    r = t - q * e
    z = (52 + -s) + 1023
  end

  h = e / 2
  q += 1 if r > h || r == h && q.odd?
  m = q - (2**52)

  puts "T: #{t}"
  puts "S: #{s}"
  puts "Q: #{q}"
  puts "R: #{r}"
  puts "H: #{h}"
  puts "Z: #{z}"
  puts "M: #{m}"

  create_float 0, z, m
end

# Expected: 1.7976931348623157e+308
#   Actual: 1.348269851146737e+308
puts to_double(17_976_931_348_623_158, 292)

The algorithm works fine for the numbers used as examples on the page (3.14159, and 1.2345678901234567e22) but fails for 1.7976931348623158e308.

I think that my problem may have to do with rounding part. A q of 9007199254740992 will fail but a q of 9007199254740991 will give me the correct answer.


Solution

  • Your Error is here:

    h = e / 2
    

    h is meant to represent half of the denominator so this should be

    h = s ** 2 / 2 
    

    A few other notes, not exhaustive, meant to be helpful not critical:

    q = t.div(2**s)
    r = t - q * 2**s
    

    Can be simplified to quotient, remainder = t.divmod(2 ** s) becuase divmod returns an Array of [quotient, modulus]