[SOLVED] What are the components of the Mel mfcc

What are the components of the Mel mfcc

In looking at the output of this line of code:

mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
print("MFCC Shape = ", mfccs.shape)

I get a response of MFCC Shape = (40,1876). What do these two numbers represent? I looked at the librosa website but still could not decipher what are these two values.

Any insights will be greatly appreciated!

Solution

The first dimension (40) is the number of MFCC coefficients, and the second dimensions (1876) is the number of time frames. The number of MFCC is specified by n_mfcc, and the number of time frames is given by the length of the audio (in samples) divided by the hop_length.

To understand the meaning of the MFCCs themselves, you should understand the steps it takes to compute them:

Spectrograms, using the Short-Time-Fourier-Transform (STFT)
The Mel spectrogram, from applying Mel scale filterbanks to the STFT
Mel Frequency Cepstral Coefficients, from applying the DCT transform on the mel-spectrogram.

A good written explainer is Haytham Fayek: Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between and a good video explainer is The Sound of AI: Mel-Frequency Cepstral Coefficients Explained Easily.