I'm new to voice recognition and I'm going through the details in this implementation of speaker verification. In data_preprocess.py
authors use librosa
library. Here is a simplified version of the code:
def preprocess_data(data_dir, res_dir, N, M, tdsv_frame, sample_rate, nfft, window_len, hop_len):
os.makedirs(res_dir, exist_ok=True)
batch_frames = N * M * tdsv_frame
batch_number = 0
batch = []
batch_len = 0
for i, path in enumerate(tqdm(os.listdir(data_dir))):
data, sr = librosa.core.load(os.path.join(data_dir, path), sr=sample_rate)
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
batch.append(S)
batch_len += S.shape[1]
if batch_len < batch_frames: continue
batch = np.concatenate(batch, axis=1)[:,:batch_frames]
np.save(os.path.join(res_dir, "voice_%d.npy" % batch_number), batch)
batch_number += 1
batch = []
batch_len = 0
N = 2 # number of speakers of batch
M = 400 # number of utterances per speaker
tdsv_frame = 80 # feature size
sample_rate = 8000 # sampling rate
nfft = 512 # fft kernel size
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"
Based on a figure in the paper, they want to create a batch of features in the size of (N*M)*tdsv_frame
.
I think I understand the concept of window_length, hop_length, but what is a question to me is how the authors set these parameters. Why we should multiple these lengths with sample_rate
as it's done here:
S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
Thank you.
librosa.core.stft
takes win_length/hop_length in number of samples. This is typical for Digital Signal Processing, as fundamentally the systems are discrete based on the number of samples per second (the sample rate).
However for ease of understanding for humans, it makes more sense to think of these times in seconds/milliseconds. As in your example
window_len = 0.025 # window length (ms)
hop_len = 0.01 # hop size (ms)
So to go from a time in seconds to time in number of samples, one has to multiply by the sample rate.