audiogoogle-cloud-platformtext-to-speechgoogle-text-to-speechamazon-polly

Google Cloud Text-to-Speech Interface Confusion (How do I download the mp3 files?)


I'd like to preface this with the fact that I am not a programmer/developer - I am a multimedia designer. I use text-to-speech to generate placeholder audio files that can be used to time animations before we record the official audio narration.

Previously I was using Amazon Polly but I wanted to give Google Cloud a try. However, I'm having the hardest time actually figuring out how to generate the mp3 files and save them.

With Amazon Polly, you simply go to a website, enter your text into a field, and click a button and it will save your file as an mp3 file. With Google Cloud, it seems far more complicated than that. The "quick start" guide has me enabling APIs, downloading JSON files, setting environment credentials, initializing SDKs, and entering code into command prompt.

Every single one of the guides I've read on their documentation page seems to inevitably lead me to a step that I just simply don't understand. I hate to sound like a complete buffoon, but this seems to be a bit over my head. I'm not looking to create software or integrate machine learning into a website, I simply just want to enter a few lines of text and generate an mp3 file.

Is there any way to do that with Google Cloud? The launch page (https://cloud.google.com/text-to-speech/) offers exactly what I want, but there is no option to download the files, just preview them.

Thanks in advance for any help you can provide to this newbie.


Solution

  • All of Google's ML related tools have a pretty poor 'general user' user experience, and are designed very specifically for programatic usage. If you're just looking for some basic tools with a reasonable nice usage it's probably not GCP at the moment.

    Given that, the samples aren't that difficult to turn into something more if you're willing to struggle a little at the beginning. I'd suggest using the command line described here.

    I'm going to add some initial steps. 1) Download and setup the Gcloud SDK tools. 2) In a terminal run gcloud auth application-default login. This will open a browser, log in like you would to the GCP Console. 3) They provided a sample request to general a file:

    curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
      -H "Content-Type: application/json; charset=utf-8" \
      --data "{
        'input':{
          'text':'Android is a mobile operating system developed by Google,
             based on the Linux kernel and designed primarily for
             touchscreen mobile devices such as smartphones and tablets.'
        },
        'voice':{
          'languageCode':'en-gb',
          'name':'en-GB-Standard-A',
          'ssmlGender':'FEMALE'
        },
        'audioConfig':{
          'audioEncoding':'MP3'
        }
      }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt
    

    This is what I meant about poor experience, the code https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt writes the results of the text to speech operation to synthesize-text.txt, and inside the txt is your mp3 file. But wait, they expect you to use it programatically so the MP3 isn't just a direct file, you might want to do something else with it so instead it's returned in an encoding called Base64, which makes it easier to use binary data over http(where text is most common). So instead of an mp3 you get a json file, like:

    { "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.." }

    That text starting with // IS your audio. But because you're doing this manually you need to copy out everything inside the quotes (It'll be a really long string of text characters starting with //... keep the // characters) into a new file called whatever you want, they named it synthesize-output-base64.txt. Then run the base64 synthesize-output-base64.txt --decode > synthesized-audio.mp3

    And you're done.... the original request lets you specify the text, voice etc. But realistically if you're looking for casual text-to-speech with a pretty UI, GCP isn't there yet.