I'm (trying to) learn AI/ML for speech synthesis and trying to undestand how HiFi-GAN is used by Vits.
From my understanding, Vits will convert text input into mel spectograms which is then converted to audio waves by HiFi-GAN.
What confuses me is why the input sent from Vits to HiFi-GAN is not a mel spectogram.
For example, when I test other models and add the code below to the forward method from HiFi-GAN:
class Generator(torch.nn.Module):
...
def forward(self, x):
plot_spectrogram(x[0].cpu().detach().numpy(), "mel_spec.png")
...
...
it saves the correct image which looks like a mel spectogram image, however, when I do the same with vits, the saved image is a plain green image which of course is not a representation of a mel spectogram.
But the resulting audio file is of course a valida audio file.
So could anyone explain that to me?
I'm evaluating a few neural tts models and what I wanted to do is save the mel spectogram created by the models to compare them later and also run them through different vocoders to compare them as well.
I noticed that the HiFi-GAN code in the vits repo is slightly different from the original repo but I can't undertand why.
Is there any way I can convert the input param x
to the mel spectogram representation without first converting it to audio and then convert the audio to mel?
In VITS, the input to the HiFi-GAN module is not a traditional mel-spectrogram, but rather latent variables produced by the Encoder, which is conditioned by the text and an alignment model. In this way, one can say that VITS is an end-to-end model.
The HiFi-GAN in VITS acts as a decoder that takes these latent variables and generates the final audio waveform directly. Therefore, what you are trying to plot is not a mel-spectrogram, but rather these latent variations.
This approach allows VITS to maintain flexibility and high-quality synthesis without relying on traditional interactive representations such as mel-spectrograms.