pythonaudiopytorchwavtorchaudio

Is it safe to truncate torchaudio's loaded 16-bit audios to `float16` from `float32`?


I have multiple WAV files with 16 bits of depth/precision. torchaudio.info(...) recognizes this, giving me:

precision = {int} 16

Yet when I use torchaudio.load(...), I get a float32 dtype for the resulting tensor. With a tensor called audio, I know that I can do audio.half() to truncate it to 16 bits, reducing memory usage of my dataset. But is this an operation that will preserve precision of all possible original values? I'm not lowering the dtype's precision below the original audio's precision, but there may be a good reason I'm unaware of as to why torchaudio still returns float32.


Solution

  • I would say it's returned as float32 because this is pytorch's default datatype. So if you create any models with weights, they'll be float32 as well. Therefore, the inputs will be incompatible with the model if you make the conversion on the input data. (E: or it will silently convert your data to 32 bit anyway, to make it compatible with your model. Not sure which pytorch opt for, but tensorflow definitely throws the error).

    Look at setting the default datatype to float16, before creating any models, if you're looking to make small models: https://pytorch.org/docs/stable/generated/torch.set_default_dtype.html

    HOWEVER note that you will lose 5 bits of precision if you convert a 16-bit int, as you have diagnosed that the number actually is (but is represented as a 32-bit float), to a 16-bit float. This is because 5 bits of precision are used in the exponent, leaving just 10 bits to represent the decimal part representing the number.

    I would just keep it at float32, if you're not particularly memory constrained.