how to calculate the timeline of an audio file after extracting MFCC features using python_speech_features
The idea is to get the timeline of the MFCC samples
import librosa
import python_speech_features
audio_file = r'sample.wav'
samples,sample_rate = librosa.core.load(audio_file,sr=16000, mono= True)
timeline = np.arange(0,len(samples))/sample_rate # prints timeline of sample.wav
print(timeline)
mfcc_feat = python_speech_features.mfcc(samples, sample_rate)
python_speech_features.mfcc(...) takes multiple additional arguments. One of them is winstep
, which specifies the amount of times between feature frames, i.e., mfcc
features. The default value is 0.01s = 10ms. In other context, e.g. librosa, this is also known as hop_length
, which is then specified in samples.
To find your timeline, you have to figure out the number of features and the feature rate. With winstep=0.01
, your features/second (your feature or frame rate) is 100 Hz. The number of frames you have is len(mfcc_feat)
.
So you'd end up with:
import librosa
import python_speech_features
import numpy as np
audio_file = r'sample.wav'
samples, sample_rate = librosa.core.load(audio_file, sr=16000, mono=True)
timeline = np.arange(0, len(samples))/sample_rate # prints timeline of sample.wav
print(timeline)
winstep = 0.01 # happens to be the default value
mfcc_feat = python_speech_features.mfcc(samples, sample_rate, winstep=winstep)
frame_rate = 1./winstep
timeline_mfcc = np.arange(0, len(mfcc_feat))/frame_rate
print(timeline_mfcc)
Since a "frame" represents a duration 0.01s, you might want to move the offset to the center of the frame, i.e., by 0.005s.