I am generating long-form content with AWS Polly using the AWS SDK for JavaScript. My content is over 3000 characters long, so I am using the long-form engine and saving all generated files in an S3 bucket (as required). Each audio file is under 4Mb in size (around 10 minutes of audio, at most).
I am able to generate the Polly files and save them, and I am able to see the contents of the s3 bucket, but I am not having luck retrieving or playing these files. What am I missing?
My project is a React/Node/Typescript web application. Right now I am running locally in a Docker container as I develop this feature.
I should note that I am new to AWS, so there may be basics that I'm missing.
I would like to either stream the content from Polly as it's generating, or at least stream it from s3 after it has completed generating.
First I tried using the Synthesize SpeechCommandOutput, and that response contained an AudioStream, which offered a function called transformToWebStream() ... but neither the AudioStream, nor the object returned from the transformToWebStream function, worked the way I would expect a readable stream to work (based on my experience with Node file handling and streaming).
const playNarration = async () => {
const stream: SynthesizeSpeechCommandOutput | undefined = await getAudiostream();
if (stream) {
console.log(stream);
const webStream: ReadableStream | undefined = stream.AudioStream?.transformToWebStream();
console.log(webStream);
if (webStream) {
webStream.on('data', (chunk: any) => { // THIS ERRORS, SAYS 'ON' IS NOT A FUNCTION
console.log(chunk);
});
}
}
I also tried using a StartSpeechSynthesisTaskCommand, grabbing the OutputUri from the SynthesisTask that is returned and sending that to an audio player (https://www.npmjs.com/package/react-h5-audio-player).
static getAudiostream(article: IArticle): Promise<StartSpeechSynthesisTaskCommandOutput | undefined> {
let streamUrl = '';
if (NarrationProvider.pollyClient) {
const bodyString = documentToPlainTextString(article.body);
const narrationParams = {
Engine: Engine.LONG_FORM,
LanguageCode: LanguageCode.en_US,
OutputFormat: OutputFormat.MP3,
Text: bodyString,
TextType: TextType.TEXT,
VoiceId: VoiceId.Danielle,
OutputS3BucketName: NarrationProvider.s3Bucket,
OutputS3KeyPrefix: article.slug,
};
const command = new StartSpeechSynthesisTaskCommand(narrationParams);
const stream = await NarrationProvider.pollyClient
.send(command)
.catch((error) => {
throw error;
});
if (stream?.SynthesisTask?.OutputUri) {
streamUrl = stream.SynthesisTask.OutputUri;
}
}
return Promise.resolve(undefined);
}
Just for the hell of it, I tried manually generating a presigned url for an s3 file and sending that to the audio player, and that didn't work either.
I can't possibly the only person wanting to put ai voices in their application, but I am not seeing any useful/recent answers here on Stack Overflow.
ReadableStream, as implemented in browser land, has differences from the way it was originally done in Node.js. So, there is no on
method. You can see the documentation here: https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream
Normally when using Polly this way, you can decode the buffer you get and play it back right away with an AudioContext.
// Generate Speech
const pollyRes = await pollyClient.send(
new SynthesizeSpeechCommand({
Engine: Engine.LONG_FORM,
LanguageCode: LanguageCode.en_US,
OutputFormat: OutputFormat.MP3,
Text: bodyString,
VoiceId: VoiceId.Danielle
})
);
// Play Speech
const audioContext = new AudioContext();
const pollyBufferSourceNode = audioContext.createBufferSource();
pollyBufferSourceNode.buffer = await audioContext.decodeAudioData(
(await pollyRes.AudioStream.transformToByteArray()).buffer
);
pollyBufferSourceNode.connect(audioContext.destination);
pollyBufferSourceNode.start();
As for the S3 output, I haven't used that method but yes I would fully expect you could sign a GET URL and do something like...
const a = new Audio(url);
a.play(); // Must be done on user click or some other interactive event