machine-learningpytorchartificial-intelligencetext-to-speech

Understanding usage of HiFi-GAN by Vits


I'm (trying to) learn AI/ML for speech synthesis and trying to undestand how HiFi-GAN is used by Vits.

From my understanding, Vits will convert text input into mel spectograms which is then converted to audio waves by HiFi-GAN.

What confuses me is why the input sent from Vits to HiFi-GAN is not a mel spectogram.

For example, when I test other models and add the code below to the forward method from HiFi-GAN:

class Generator(torch.nn.Module):
  ...
  def forward(self, x):
    plot_spectrogram(x[0].cpu().detach().numpy(), "mel_spec.png")
    ...
  ...

it saves the correct image which looks like a mel spectogram image, however, when I do the same with vits, the saved image is a plain green image which of course is not a representation of a mel spectogram.

But the resulting audio file is of course a valida audio file.

So could anyone explain that to me?

I'm evaluating a few neural tts models and what I wanted to do is save the mel spectogram created by the models to compare them later and also run them through different vocoders to compare them as well.

I noticed that the HiFi-GAN code in the vits repo is slightly different from the original repo but I can't undertand why.

Is there any way I can convert the input param x to the mel spectogram representation without first converting it to audio and then convert the audio to mel?


Solution

  • In VITS, the input to the HiFi-GAN module is not a traditional mel-spectrogram, but rather latent variables produced by the Encoder, which is conditioned by the text and an alignment model. In this way, one can say that VITS is an end-to-end model.

    The HiFi-GAN in VITS acts as a decoder that takes these latent variables and generates the final audio waveform directly. Therefore, what you are trying to plot is not a mel-spectrogram, but rather these latent variations.

    This approach allows VITS to maintain flexibility and high-quality synthesis without relying on traditional interactive representations such as mel-spectrograms.