voice-recognitionvoicelibrosasample-rate

Why window_length/hop_length are multiplied with sample rate in librosa.core.stft in this example?


I'm new to voice recognition and I'm going through the details in this implementation of speaker verification. In data_preprocess.py authors use librosa library. Here is a simplified version of the code:

def preprocess_data(data_dir, res_dir, N, M, tdsv_frame, sample_rate, nfft, window_len, hop_len):
    os.makedirs(res_dir, exist_ok=True)
    batch_frames = N * M * tdsv_frame
    batch_number = 0
    batch = []
    batch_len = 0
    for i, path in enumerate(tqdm(os.listdir(data_dir))):
        data, sr = librosa.core.load(os.path.join(data_dir, path), sr=sample_rate)
        S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
        batch.append(S)
        batch_len += S.shape[1]
        if batch_len < batch_frames: continue
        batch = np.concatenate(batch, axis=1)[:,:batch_frames]
        np.save(os.path.join(res_dir, "voice_%d.npy" % batch_number), batch)
        batch_number += 1
        batch = []
        batch_len = 0


N = 2               # number of speakers of batch
M = 400             # number of utterances per speaker
tdsv_frame = 80     # feature size
sample_rate = 8000  # sampling rate
nfft = 512          # fft kernel size
window_len = 0.025  # window length (ms)
hop_len = 0.01      # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"

Based on a figure in the paper, they want to create a batch of features in the size of (N*M)*tdsv_frame. enter image description here

I think I understand the concept of window_length, hop_length, but what is a question to me is how the authors set these parameters. Why we should multiple these lengths with sample_rate as it's done here:

S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))

Thank you.


Solution

  • librosa.core.stft takes win_length/hop_length in number of samples. This is typical for Digital Signal Processing, as fundamentally the systems are discrete based on the number of samples per second (the sample rate).

    However for ease of understanding for humans, it makes more sense to think of these times in seconds/milliseconds. As in your example

    window_len = 0.025  # window length (ms)
    hop_len = 0.01      # hop size (ms)
    

    So to go from a time in seconds to time in number of samples, one has to multiply by the sample rate.