python-3.xaudiomfcc

Python Librosa : What is the default frame size used to compute the MFCC features?


Using Librosa library, I generated the MFCC features of audio file 1319 seconds into a matrix 20 X 56829. The 20 here represents the no of MFCC features (Which I can manually adjust it). But I don't know how it segmented the audio length into 56829. What is the frame size it takes process the audio?

import numpy as np
import matplotlib.pyplot as plt
import librosa

def getPathToGroundtruth(episode):
    """Return path to groundtruth file for episode"""
    pathToGroundtruth = "../../../season01/Audio/" \
                        + "Season01.Episode%02d.en.wav" % episode
    return pathToGroundtruth

def getduration(episode):
    pathToAudioFile = getPathToGroundtruth(episode)
    y, sr = librosa.load(pathToAudioFile)
    duration = librosa.get_duration(y=y, sr=sr)
    return duration
def getMFCC(episode):
    filename = getPathToGroundtruth(episode)
    y, sr = librosa.load(filename)  # Y gives 
    data = librosa.feature.mfcc(y=y, sr=sr)
    return data


data = getMFCC(1)

Solution

  • Short Answer

    You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)

    data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20)
    

    Long Answer

    Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).

    All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.

    All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()

    By Default, the Mel-scaled power spectrogram window and hop length are the following:

    n_fft=2048

    hop_length=512

    So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:

    output length = (seconds) * (sample rate) / (hop_length)

    (1319) * (22050) / (512) = 56804 samples

    The parameters that you are able to tune, are the following:

    Melspectrogram Parameters
    -------------------------
    y : np.ndarray [shape=(n,)] or None
        audio time-series
    
    sr : number > 0 [scalar]
        sampling rate of `y`
    
    S : np.ndarray [shape=(d, t)]
        power spectrogram
    
    n_fft : int > 0 [scalar]
        length of the FFT window
    
    hop_length : int > 0 [scalar]
        number of samples between successive frames.
        See `librosa.core.stft`
    
    kwargs : additional keyword arguments
      Mel filter bank parameters.
      See `librosa.filters.mel` for details.
    

    If you want to further specify characteristics of the mel filterbank used to define the Mel-scaled power spectrogram, you can tune the following

    Mel Frequency Parameters
    ------------------------
    sr        : number > 0 [scalar]
        sampling rate of the incoming signal
    
    n_fft     : int > 0 [scalar]
        number of FFT components
    
    n_mels    : int > 0 [scalar]
        number of Mel bands to generate
    
    fmin      : float >= 0 [scalar]
        lowest frequency (in Hz)
    
    fmax      : float >= 0 [scalar]
        highest frequency (in Hz).
        If `None`, use `fmax = sr / 2.0`
    
    htk       : bool [scalar]
        use HTK formula instead of Slaney
    

    Documentation for Librosa:

    librosa.feature.melspectrogram

    librosa.filters.mel

    librosa.core.stft