text-to-speechvoicehidden-markov-modelsspeech-synthesishtk

What is the purpose of speaker adaptive training and speaker dependent training?


I'm trying to create a TTS engine for Indian Accented English (not any Indian language).

I already have a voice recordings database for Indian accented English. So what are the next steps ?

I think we need to label them with the ".lab" extension files (well I don't really know about it!). And what are the files with ".utts" extension for ?

What is the purpose of speaker adaptive training and speaker dependent training in implementing a TTS engine using HMM ?

I googled a lot but couldn't find a detailed explanation for them. (all I could find was some Papers and Journals related to it)

It would be really helpful if you could provide we with the links to resources which guide me in creating a custom TTS using the Hidden Marvkov Models.

Thank you.


Solution

  • Festival is a good concatenative speech synthesis tool which also uses HMM.
    HTS is another good HMM based synthesizer.

    .lab or .phn files are label files where each word is split into phonemes with corresponding time stamps from the audio. Eg for an audio file containing word "this", label file can be:

    0.28 0.35 sil
    0.35 0.42 dh
    0.42 0.5 i
    0.5 0.61 s
    

    where the numbers are starting and ending time in seconds for pronunciation of phoneme.

    .utt are utterance files which are formed after all information like stress, part of speech, intonation, duration of speech etc are taken into account. These files can then be used for speech output (playing the utterance)

    The quality of speech synthesized depends upon the audio set used for training. Speaker adaptive training adapts the model to accomodate speakers with different voices and accents/dialects. Separate models are trained in case of speaker dependent training for different voices.

    You can go through the Festival Manual to know how to set up a speech synthesis pipeline. Festival along with HTS is also used where Festival is used for front-end text-analysis (creating dictionary, word to phoneme etc) whereas HTS is used for HMM based speech modelling.