conv-neural-networkmfcc

Using MFCCs and Mel-Spectrograms with CNN


I would like to get some feedback as to why in a lot of research papers that the researchers pass MFCCs through a Convolution Neural Network (CNN)? Inherently, the CNN in itself is a feature extraction process.

Any tips and advice as to why this process is commonly used.

Thanks!


Solution

  • MFCCs mimic non-linear human ear perception of sound and it approximates the human auditory system's response. Therefore, MFCCs are widely used in speech recognition.

    While CNNs are being used in feature extraction, raw audio signals are not commonly used as input in CNNs. The reason for this is audio signals are inherently being prone to noise, and are often contaminated with frequency bands that are not useful for the intended applications. Therefore, it is a common practice to preprocess the signal to remove noise and remove irrelevant frequency bands by means of bandpass filters, and then extract relevant features from it. The features can either be time-domain features; such as amplitude envelope, root mean square energy, or zero-crossing rate, or frequency domain features; such as band energy ratio, spectral centroid, and spectral flux, or time-frequency representations; such as spectrogram and mel-spectrogram.

    CNNs are then used to extract local patterns in these extracted features. Especially, for the time-frequency representations, 2D CNNs are used to extract features, similar to the feature extraction process in image recognition applications.