vxml

TTS concatenation based on user input


Greeting StackOverflow community,

Is it possible to take what a user says or enters (like the letters 1 - 9) and instead of the text to speech engine reading the numbers back to the user it plays a prerecorded audio clip so it sounds like our voiceover person instead of the robot?

Can you do this dynamically based on what the user inputs?

All i'm really asking for is a prod in the correct direction of how to start figuring this out.


Solution

  • You can. I've written logic, a long time ago, that takes the desired phrase and a list of available clips to find the largest segments (clips often had multiple phrases) that could be used to assemble the audio. It tends to sound very choppy, but it is possible if you have enough prerecorded audio. In my case the content was in a niche and could be accomplished with 95% coverage with only a couple thousand recordings.

    At the end, it was just basic search logic to find clips. If you do this at the word level, you could just name each clip with the word and split the input and generate the audio tags. <audio src='the.wav'/><audio src='quick.wav'/><audio src='brown.wav'/><audio src='fox.wav'/>...