pythonnormalizationscalingspectrogramfrequency-analysis

Normalize a melspectrogram to (0, 255) with or without frequency scaling


I am converting multiple log-mel spectrograms from .wav files to images. I want to destroy as little information as possible as I plan to use the resulting images for a computer vision task. To convert the data to an image format, I currently use a simple sklearn.MinMaxScaler((0, 255)). To fit this scaler, I use the minimal and the maximal energy of all frequencies on all my spectrograms.

Should I scale my spectrograms with minimal and maximal energy for each specific frequency?

Does it make sense to have different frequencies with different scaling features?


Solution

  • Spectrograms are tricky to use as input to computer vision algorithms, specially to neural networks, due to their skewed, non-normal distribution nature. To tackle this you should:

    1. Normalized the input: transform the values either with a simple log(1+c) (first option) or a box-cox transformation (second option), which should expand low values and compress high ones, making the distribution more Gaussian.
    2. Then bring the transformed values into an interval suitable for your use case. In the case of CNNs a MinMaxScaler should be good enough for this, but change the interval to [0, 1], i.e. sklearn.MinMaxScaler((0, 1)). For classic computer vision, this could be sklearn.MinMaxScaler((0, 255))

    So,

    Should I scale my spectrograms with minimal and maximal energy for each specific frequency?

    Yes, once the normalization is done

    and

    Does it make sense to have different frequencies with different scaling features?

    It depends. For CNNs your input data needs to be consistent for good results. For classic computer vision approaches, could be, depending on what you want to do with it