I am extracting MFCCs from an audio file using Librosa's function (librosa.feature.mfcc) and I correctly get back a numpy array with the shape I was expecting: 13 MFCCs values for the entire length of the audio file which is 1292 windows (in 30 seconds).
What is missing is timing information for each window: for example I want to know what the MFCC looks like at time 5000ms, then at 5200ms etc. Do I have to manually calculate the time? Is there a way to automatically get the exact time for each window?
The "timing information" is not directly available, as it depends on sampling rate. In order to provide such information, librosa
would have create its own classes. This would rather pollute the interface and make it much less interoperable. In the current implementation, feature.mfcc
returns you numpy.ndarray
, meaning you can easily integrate this code anywhere in Python.
To relate MFCC to timing:
import librosa
import numpy as np
filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
hop_length = 512 # number of samples between successive frames
mfcc = librosa.feature.mfcc(y=y, n_mfcc=13, sr=sr, hop_length=hop_length)
audio_length = len(y) / sr # in seconds
step = hop_length / sr # in seconds
intervals_s = np.arange(start=0, stop=audio_length, step=step)
print(f'MFCC shape: {mfcc.shape}')
print(f'intervals_s shape: {intervals_s.shape}')
print(f'First 5 intervals: {intervals_s[:5]} second')
Note that array length of mfcc
and intervals_s
is the same - a sanity check that we did not make a mistake in our calculation.
MFCC shape: (13, 2647)
intervals_s shape: (2647,)
First 5 intervals: [0. 0.02321995 0.04643991 0.06965986 0.09287982] second