graphicsfloating-pointfixed-point

When converting floating point to fixed point why have I seen some code do BIT_WIDTH ^ 2 MINUS 1?


In my mind intuitively, and even after some thought, if I want to convert a normalised float value back to fixed point I would multiply it by the max value able to be held by the fixed-point format. So multiply by 255 or 65535, or whatever. However in some cases I have seen some people or code insist that the correct way is to multiply by 255 - 1, or 65535 - 1. I don't know why this is or why it should be the case. Is this to deal overflow? I don't see what bad could happen if you multiplied 1.0f * 255. Even if 1.0f can't be perfectly represented in IEEE 754 and the multiplication yields 255.0001, then when casting that to an integer it still ends up as 255, no overflow.

Edit: Sorry, there are two mistakes in my question. I mean 2 ^ bit_width - 1, not bit_width ^ 2 - 1. Which is presumably correct, so for a 8 bit integer you would multiply by 255, and 16 bit integer you would multiply by 65535. My question is, I'm pretty sure I've seen code that multiplies by 255 - 1, that is 254. Is there any reason to multiply by 254 instead of 255?


Solution

  • It is very unlikely you saw any code multiplying by BIT_WIDTH ^ 2 MINUS 1. For a width of, say, 16, this would multiply by 162−1 = 256−1 = 255, while the maximum value in a 16-bit unsigned integer is 65,535. It is more likely you saw code multiplying by 2 ^ BIT_WDITH - 1, which would produce 65,535.

    Consider an n-bit unsigned integer format. Its values range from 0 to 2n−1. Somebody mapping this to a floating-point interval [0, 1] might choose the mapping f: xx / (2n−1), as that maps 0 to 0, 2n−1 to 1, and every value between 0 and 2n−1 to a value between 0 and 1.

    Once that choice is made, the inverse mapping is f−1: yy•(2n−1).

    However in some cases I have seen some people or code insist that the correct way is to multiply by 255 - 1, or 65535 - 1.

    I see no reason for this. Show us the words of those people or the text of that code.

    If the reverse map is yy•(2n−2), then the forward map is xx / (2n−2). For n = 8, this would map 255 to 255/254 = 1.045…, which is not within what would usually be used for a normalized interval, [0, 1].

    One way this might make sense is if the fixed-point format reserved the value 2n−1 to denote some exceptional condition (such as that an error has occurred or data is missing), so its interval of numeric values were [0, 2n−2]. In that case, the maps would be xx / (yy•(2n−2) and yy•(2n−2).