voice_kal_diphone
and voice_ral_diphone
work correctly in singing mode (there's vocal output and the pitches are correct for the specified notes).
voice_cmu_us_ahw_cg
and the other CMU voices do not work correctly--there's vocal output but the pitch is not changed according to the specified notes.
Is it possible to get correct output with the higher quality CMU voices?
The command line for working (pitch-affected) output is:
text2wave -mode singing -eval "(voice_kal_diphone)" -o song.wav song.xml
The command line for non-working (pitch-unaffected) output is:
text2wave -mode singing -eval "(voice_cmu_us_ahw_cg)" -o song.wav song.xml
Here's song.xml
:
<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []>
<SINGING BPM="60">
<PITCH NOTE="A4,C4,C4"><DURATION BEATS="0.3,0.3,0.3">nationwide</DURATION></PITCH>
<PITCH NOTE="C4"><DURATION BEATS="0.3">is</DURATION></PITCH>
<PITCH NOTE="D4"><DURATION BEATS="0.3">on</DURATION></PITCH>
<PITCH NOTE="F4"><DURATION BEATS="0.3">your</DURATION></PITCH>
<PITCH NOTE="F4"><DURATION BEATS="0.3">side</DURATION></PITCH>
</SINGING>
You may also need this patch to singing-mode.scm
:
@@ -339,7 +339,9 @@
(defvar singing-max-short-vowel-length 0.11)
(define (singing_do_initial utt token)
- (if (equal? (item.name token) "")
+ (if (and
+ (not (equal? nil token))
+ (equal? (item.name token) ""))
(let ((restlen (car (item.feat token 'rest))))
(if singing-debug
(format t "restlen %l\n" restlen))
To set up my environment I used the festvox fest_build script. You can also download voice_cmu_us_ahw_cg separately.
It seems that the problem is in phones generation.
voice_kal_diphone
uses UniSyn
synthesis model, while voice_cmu_us_ahw_cg
uses ClusterGen
model. The last one has own intonation and duration model (state-based) instead of phone intonation/duration: possibly you noticed that duration didn't changed too in generated 'song'.
singing-mode.scm
tries to extract each syllable and modify its frequency. In case of ClusterGen
model wave generator simply ignores syllables frequencies and durations set in Target
due to different modelling.
As a result we have better voice quality (based on statistic model), but can't change frequency directly.
Very good description of generation pipeline can be found here.