Some background: In trying to build a unit selection voice I followed the steps here: https://github.com/CSTR-Edinburgh/CSTR-Edinburgh.github.io/blob/master/_posts/2016-8-21-Multisyn_unit_selection.md and used a voice definition from here: https://raw.githubusercontent.com/CSTR-Edinburgh/merlin/master/egs/hybrid_synthesis/s1/voice_definition_files/unit_selection/cstr_us_awb_arctic_multisyn.scm. Unfortunately, the wavs were too noisy so I ended up hand-labelling them and skipping the automatic labelling process.
The voice is ok now but still needs some work. One error that occurs constantly is that festival reports "Missing diphone" for any pause to phone transition, e.g.:
festival> (utt.relation.print (SayText "I can say anything I want.") 'Unit)
Missing diphone: #_ay
diphone still missing, backing off: #_ay
backed off: #_ay -> #_ax
diphone still missing, backing off: #_ax
backed off: #_ay -> #_#
diphone still missing, backing off: #_#
backed off: #_ay ->
Missing diphone: ey_eh
Interword so inserting silence.
diphone still missing, backing off: ey_#
backed off: ey_eh -> ax_#
diphone still missing, backing off: ax_#
backed off: ey_eh -> #_#
diphone still missing, backing off: #_#
backed off: ey_eh ->
Missing diphone: #_eh
diphone still missing, backing off: #_eh
backed off: #_eh -> #_ax
diphone still missing, backing off: #_ax
backed off: #_eh -> #_#
diphone still missing, backing off: #_#
backed off: #_eh ->
Missing diphone: t_#
diphone still missing, backing off: t_#
backed off: t_# -> #_#
diphone still missing, backing off: #_#
backed off: t_# ->
I tried replacing sil
and sp
(from the automatic process) in the labels with pau
and h#
(in order to correspond with the silences used in festival/lib/radio_phones.scm), and I also tried replacing them with just #
but this didn't change anything. The source wav/labs definitely contain the transitions above (e.g. several start with "I can") but festival never seems to use these.
How can I get festival to use the pause to phone transitions in the source data?
Thanks!
What was happening was when I was running a script based on the Multisyn unit selection the build_utts part was failing and skipping because the hand-labelled labels didn't match exactly what Festival would have predicted. For example, if the speaker had said "extreme" as eh k s ...
but Festival would calculate ih k s ...
the build_utts script would fail with an error like:
align missmatch at ih (0.000000) eh (2.810566)
I manually ran the build_utts script for each utterance and adjusted the label accordingly. If, like me, you are foolish enough to try hand-labelling yourself a couple of tips that helped me:
t_cl
or d_cl
as these can really mess it up when it's trying to matchMake sure there is a pause (i.e. #
) at the start and end of each utterance as the build_utts script won't complain about it but when running the voice in Festival you will get an error like:
-=-=-=-=-=- EST Error -=-=-=-=-=-
{FND} Feature end not defined
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Thanks to @NikolayShmyrev for pointing me in the right direction. He also recommended using Ossian instead of Festival which uses python rather than Festival's fairly difficult code.