audioalexa-skills-kitssml

How is Alexa programmed to sing?


If you say "Alexa, sing for me", she will choose one of several songs that have been created with her voice. The voice(s) for each of these songs must have been created somehow.

At first, I thought that SSML would provide the tools necessary to do this, especially the <prosody> tag which has parameters for pitch and rate (duration).

I thought perhaps each syllable of singing could have its pronunciation specified with <phoneme> and its pitch and duration specified with <prosody>, with <break> tags in between:

<speak>
  <prosody rate="20%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
    <break strength="none" />
  </prosody>
  <prosody rate="20%" pitch="+50%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
    <break strength="none" />
  </prosody>
  <prosody rate="20%">
    <phoneme alphabet="x-sampa" ph="U">oo</phoneme>
  </prosody>
</speak> 

However, when executed, Alexa applies her built-in inflection (to sound like a real human), and so the tone is not flat. These "ooh" sounds (above), for example, each have a falling tone. (They also have a noticeable break between phonemes even tho "no break" was explicitly specified.)

So then, how did the Alexa voice which is heard singing all of those songs get programmed? Was it via tools currently only available to Amazon developers?

It's also perplexing to me that I am apparently the only person on the internet even asking this question (based on zero results in stackoverflow, google, etc), especially this late in the game. Aren't there loads of musicians out there who would love to be able to make Alexa sing whatever they want?

Edit: Guys, I thought it was common knowledge, but there is no human voice actor behind Alexa. Her voice is completely computer-generated.


Solution

  • Alexa's voice is completely computer generated and so are the songs. Research is on-going into generating a singing synthesizer model (#1 and #2).

    Here's a video by Popgun Labs regarding how they make their AI sing. Although I am unable to find how Amazon and Google do this, my guess it will be something similar.

    EDIT: My earlier answer was based on an extension page and drew incorrect inconclusions.