deep-learningpytorchfairseq

Why does the output from VQ-Wav2Vec from FairSeq missing frames?


I am using the fairseq library to run an example code for feature extraction with the VQ-Wav2Vec code as written below:

In [6]: import torch
   ...: from fairseq.models.wav2vec import Wav2VecModel

In [7]: cp = torch.load('wav2vec_models/checkpoint_best.pt')
   ...: model = Wav2VecModel.build_model(cp['args'], task=None)
   ...: model.load_state_dict(cp['model'])
   ...: model.eval()

In [9]: wav_input_16khz = torch.randn(1,10000)
   ...: z = model.feature_extractor(wav_input_16khz)
   ...: f, idxs = model.vector_quantizer.forward_idx(z)
   ...: print(idxs.shape, f.shape)

>>>> torch.Size([1, 60, 4]) torch.Size([1, 512, 60])

My understanding is that the vq-wav2vec processes every 10ms of input speech (assumed to be sampled at 16K samples / sec) samples and outputs a feature vector of size [512] samples for each of these 10ms of speech. So given that the input speech is 10000 samples, we are supposed to get 62 frames ( 62 * 160 = 9920 samples).

Why do I see only 60 frames?


Solution

  • From the article (arxiv.org/pdf/1904.05862.pdf): "The output of the encoder is a low frequency feature representation zi ∈Z which encodes about 30 ms of 16 kHz of audio and the striding results in representations zi every 10ms." => The windows are overlapping and this explains why you are getting 2 frames fewer. Indeed we are moving a 30 ms window by 10ms steps. In your example, the 30 ms window takes 60 different positions.