floating-pointnormalizationwavaiff

Is it correct to assume that floating-point samples in a WAV or AIFF file will be normalized?


Say I have a program that reads a .WAV or .AIFF file, and the file's audio is encoded as floating-point sample-values. Is it correct for my program to assume that any well-formed (floating-point-based) .WAV or .AIFF file will contain sample values only in the range [-1.0f,+1.0f]? I couldn't find anything in the WAV or AIFF specifications that addresses this point.

And if that is not a valid assumption, how can one know what the full dynamic range of the audio in the file was intended to be? (I could read the entire file and find out what the file's actual minimum and maximum sample values are, but there are two problems with that: (1) it would be a slow/expensive operation if the file is very large, and (2) it would lose information, in that if the file's creator had intended the file to have some "headroom" so as not play at dbFS at its loudest point, my program would not be able to detect that)


Solution

  • As you state, the public available documentation do not go into details about the range used for floating point. However, from practice in the industry over the last several years, and from actual data existing as floating point files, I would say it is a valid assumption.

    There are practical reasons to this as well as a very common range for normalization of high-precision data being color, audio, 3D etc.

    The main reason for the range to be in the interval [-1, 1] is that it is fast and easy to scale/convert to the target bit-range. You only need to supply the target range and multiply.

    For example:

    If you want to play it at 16-bit you would do (pseudo, assuming signed rounded to integer result):

    sample = in < 0 ? in * 0x8000 : in * 0x7fff;
    

    or 24-bit:

    sample = in < 0 ? in * 0x800000 : in * 0x7fffff;
    

    or 8-bit:

    sample = in < 0 ? in * 0x80 : in * 0x7f;
    

    etc. without having to adjust the original input value in any way. -1 and 1 would represent min/max value when converted to target (1x = x).

    If you used a range of [-0.5, 0.5] you would first (or at some point) have to adjust the input value so a conversion to for example 16-bit would need extra steps - this has an extra cost, not only for the extra step but also as we would work in the floating point domain which is heavier to compute (the latter is perhaps a bit legacy reason as floating point processing is pretty fast nowadays, but in any case).

    in = in * 2;
    sample = in < 0 ? in * 0x8000 : in * 0x7fff;
    

    Keeping it in the [-1, 1] range rather than some pre-scaled ranged (for example [-32768, 32767]) also allow use of more bits for precision (using the IEEE 754 representation).

    UPDATE 2017/07

    Tests

    Based on questions in comments I decided to triple-check by making a test using three files with a 1 second sine-wave:

    A) Floating point clipped
    B) Floating point max 0dB, and
    C) integer clipped (converted from A)

    The files where then scanned for positive values <= -1.0 and >= 1.0 starting after the data chunk and size field to make min/max values reflect the actual values found in the audio data.

    The results confirms that the range is indeed in the [-1, 1] inclusive range, when not clipping (non-true <= 0 dB).

    But it also revealed another aspect -

    WAV files saved as floating point do allow values exceeding the 0 dB range. This means the range is actually beyond [-1, 1] for values that normally would clip.

    The explanation for this can be that floating point formats are intended for intermediate use in production setups due to very little loss of dynamic range, where future processing (gain-staging, compressing, limiting etc.) can bring back the values (without loss) well within the final and normal -0.2 - 0 dB range; and therefor preserves the values as-is.

    In conclusion

    WAV files using floating point will save out values in the [-1, 1] when not clipping (<= 0dB), but does allow for values that are considered clipped

    But when converted to a integer format these values will clip to the equivalent [-1, 1] range scaled by the bit-range of the integer format, regardless. This is natural due to the limited range each width can hold.

    It will therefor be up the player/DAW/edit software to handle clipped floating point values by either normalizing the data or simply clip back to [-1, 1].

    file1
    Notes: Max values for all files are measured directly from the sample data.

    file2
    Notes: Produced as clipped float (+6 dB), then converted to signed 16-bit and back to float

    file3
    Notes: Clipped to +6 dB

    file4
    Notes: Clipped to +12 dB

    Simple test script and files can be found here.