pythonazure-cognitive-services

How to save a stream object in Azure text to speech without speaking the text using Python


I want to convert a book to audio, and save the file, so naturally I don't want my computer to be speaking the book out loud while the conversion happens, but looking at the Azure documentation, I frankly don't see a way to get a stream object without speaking the text first. I've already got the code set up so that I can save the file, but I can't save the file unless I play that audio first. I want to convert some text to a stream object without having to listen to my computer utter the text. I realize a very inelegant solution is to simply mute my computer, but still, suppose the conversion takes an hour and I need to take a phone call on it.

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(subscription=subscription_key,
                                       region=service_region)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)

In the following line, I don't want to do this step because this utters the audio:

result = speech_synthesizer.speak_text_async("I'm excited to try text to speech").get()

But I have to do that step in order to get the following steps:

stream = AudioDataStream(result)
    

stream.save_to_wav_file(path)

here

I've tried looking at all the methods listed in the speech_synthesizer object but all of them involve speaking the text, they are listed here:

class SpeechSynthesizer(builtins.object)
 |  SpeechSynthesizer(speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None)
 |  
 |  A speech synthesizer.
 |  
 |  :param speech_config: The config for the speech synthesizer
 |  :param audio_config: The config for the audio output.
 |      This parameter is optional.
 |      If it is not provided, the default speaker device will be used for audio output.
 |      If it is None, the output audio will be dropped.
 |      None can be used for scenarios like performance test.
 |  :param auto_detect_source_language_config: The auto detection source language config
 |  
 |  Methods defined here:
 |  
 |  __del__(self)
 |  

 |  __init__(self, speech_config: azure.cognitiveservices.speech.SpeechConfig, audio_config: Optional[azure.cognitiveservices.speech.audio.AudioOutputConfig] = <azure.cognitiveservices.speech.audio.AudioOutputConfig object at 0x137ffc790>, auto_detect_source_language_config: azure.cognitiveservices.speech.languageconfig.AutoDetectSourceLanguageConfig = None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  get_voices_async(self, locale: str = '') -> azure.cognitiveservices.speech.ResultFuture
 |      Get the available voices, asynchronously.
 |      
 |      :param locale: Specify the locale of voices, in BCP-47 format; or leave it empty to get all available voices.
 |      :returns: A task representing the asynchronous operation that gets the voices.
 |  
 |  speak_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
 |      Performs synthesis on ssml in a blocking (synchronous) mode.
 |      
 |      :returns: A SpeechSynthesisResult.
 |  
 |  speak_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture
 |      Performs synthesis on ssml in a non-blocking (asynchronous) mode.
 |      
 |      :returns: A future with SpeechSynthesisResult.
 |  
 |  speak_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
 |      Performs synthesis on plain text in a blocking (synchronous) mode.
 |      
 |      :returns: A SpeechSynthesisResult.
 |  
 |  speak_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture
 |      Performs synthesis on plain text in a non-blocking (asynchronous) mode.
 |      
 |      :returns: A future with SpeechSynthesisResult.
 |  
 |  start_speaking_ssml(self, ssml: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
 |      Starts synthesis on ssml in a blocking (synchronous) mode.
 |      
 |      :returns: A SpeechSynthesisResult.
 |  
 |  start_speaking_ssml_async(self, ssml: str) -> azure.cognitiveservices.speech.ResultFuture
 |      Starts synthesis on ssml in a non-blocking (asynchronous) mode.
 |      
 |      :returns: A future with SpeechSynthesisResult.
 |  
 |  start_speaking_text(self, text: str) -> azure.cognitiveservices.speech.SpeechSynthesisResult
 |      Starts synthesis on plain text in a blocking (synchronous) mode.
 |      
 |      :returns: A SpeechSynthesisResult.
 |  
 |  start_speaking_text_async(self, text: str) -> azure.cognitiveservices.speech.ResultFuture
 |      Starts synthesis on plain text in a non-blocking (asynchronous) mode.
 |      
 |      :returns: A future with SpeechSynthesisResult.
 |  
 |  stop_speaking(self) -> None
 |      Synchronously terminates ongoing synthesis operation.
 |      This method will stop playback and clear unread data in PullAudioOutputStream.
 |  
 |  stop_speaking_async(self) -> azure.cognitiveservices.speech.ResultFuture
 |      Asynchronously terminates ongoing synthesis operation.
 |      This method will stop playback and clear unread data in PullAudioOutputStream.
 |      
 |      :returns: A future that is fulfilled once synthesis has been stopped.
 |  
UPDATE

Someone recommended using the synthesize_speech_to_stream_async method but his code resulted in errors and I haven't heard back from him, but I think he might be on to something.

His code was

speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)
speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural'
stream = speechsdk.AudioDataStream(format=speechsdk.AudioStreamFormat(pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit, sample_rate_hertz=16000, channel_count=1))
result = speechsdk.SpeechSynthesizer(speech_config=speech_config).synthesize_speech_to_stream_async("I'm excited to try text to speech", stream).get()
stream.save_to_wav_file(path)

This generated an error:

stream = speechsdk.AudioDataStream(
            format=speechsdk.AudioStreamFormat(
                pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit,
                sample_rate_hertz=16000, channel_count=1))

which recommended:

    stream = speechsdk.AudioDataStream(
        format=speechsdk.AudioStreamWaveFormat(
            pcm_data_format=speechsdk.PcmDataFormat.Pcm16Bit,
            sample_rate_hertz=16000, channel_count=1))

But that generated:

AttributeError: module 'azure.cognitiveservices.speech' has no attribute 'PcmDataFormat'

Solution

  • I tried the following code to save a stream audio to a .wav file in Azure Text to Speech without speaking the text, using Python.

    Code :

    import azure.cognitiveservices.speech as speechsdk
    import io
    import tempfile
    
    subscription_key = '<speech_key>'
    service_region = '<speech_region>'
    
    speech_config = speechsdk.SpeechConfig(subscription=subscription_key, region=service_region)
    speech_config.speech_synthesis_voice_name = 'ar-EG-SalmaNeural' 
    temp_file_path = tempfile.NamedTemporaryFile(suffix=".wav", delete=False).name
    audio_config = speechsdk.audio.AudioOutputConfig(filename=temp_file_path)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    
    text_to_speak = "Hi Kamali! I am happy to see you."
    result = speech_synthesizer.speak_text_async(text_to_speak).get()
    
    file_path = 'output.wav'
    with open(file_path, 'wb') as audio_file:
        audio_file.write(result.audio_data)
    
    print(f"Audio saved to {file_path}")
    

    Output :

    The program ran successfully, converting the text to speech and saving it as a .wav file without any spoken output.

    C:\Users\xxxxx\Documents\xxxxx>python sample.py
    Audio saved to output.wav
    

    enter image description here