pythonaudioaudio-processinglibrosamfcc

How to calculate the timeline of an audio file after extracting MFCC features


how to calculate the timeline of an audio file after extracting MFCC features using python_speech_features

The idea is to get the timeline of the MFCC samples

import librosa
import python_speech_features

audio_file = r'sample.wav'

samples,sample_rate = librosa.core.load(audio_file,sr=16000, mono= True)

timeline = np.arange(0,len(samples))/sample_rate # prints timeline of sample.wav

print(timeline)

mfcc_feat = python_speech_features.mfcc(samples, sample_rate)

Solution

  • python_speech_features.mfcc(...) takes multiple additional arguments. One of them is winstep, which specifies the amount of times between feature frames, i.e., mfcc features. The default value is 0.01s = 10ms. In other context, e.g. librosa, this is also known as hop_length, which is then specified in samples.

    To find your timeline, you have to figure out the number of features and the feature rate. With winstep=0.01, your features/second (your feature or frame rate) is 100 Hz. The number of frames you have is len(mfcc_feat).

    So you'd end up with:

    import librosa
    import python_speech_features
    import numpy as np
    
    audio_file = r'sample.wav'
    
    samples, sample_rate = librosa.core.load(audio_file, sr=16000, mono=True)
    
    timeline = np.arange(0, len(samples))/sample_rate # prints timeline of sample.wav
    
    print(timeline)
    
    winstep = 0.01  # happens to be the default value
    mfcc_feat = python_speech_features.mfcc(samples, sample_rate, winstep=winstep)
    
    frame_rate = 1./winstep
    
    timeline_mfcc = np.arange(0, len(mfcc_feat))/frame_rate
    print(timeline_mfcc)
    

    Since a "frame" represents a duration 0.01s, you might want to move the offset to the center of the frame, i.e., by 0.005s.