pythonazurepython-multithreadingspeakerazure-speech

Transcribing Stereo Mix (Speaker Output) Audio with Azure AI Speech Service is very slow with threads?


I have been struggling for very long to transcribe real-time both microphone input and speaker output audios at the same time for call center use case. The background info of this project and my struggles are both documented in a previous question.

Code:

import azure.cognitiveservices.speech as speechsdk
from dotenv import load_dotenv
import os
import threading

# Store credentials in .env file, initialize speech_config
load_dotenv()
audio_key = os.getenv("audio_key")
audio_region = os.getenv("audio_region")
speech_config = speechsdk.SpeechConfig(subscription=audio_key, region=audio_region)
speech_config.speech_recognition_language = "en-US"

# Endpoint strings found using aforementioned code
mic = "{0.0.1.00000000}.{6dd64d0d-e876-4f3f-b1fe-464843289599}"
stereo_mix = "{0.0.1.00000000}.{c4c4d95c-5bd1-4f09-a07e-ad3a96c381f0}"

# Initialize audio_config as shown in Azure documentation
microphone_audio_config = speechsdk.audio.AudioConfig(device_name=mic)
speaker_audio_config = speechsdk.audio.AudioConfig(device_name=stereo_mix)

# Azure Speech-to-Text Conversation Transcriber
def transcribing(evt, name):
    print(f"{name} transcribing! {evt}")

def transcribed(evt, name):
    print(f"{name} transcribed! {evt}")
 
# Function to start Azure speech recognition
def start_recognition(audio_config, speech_config, name):
    transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
    
    transcriber.transcribed.connect(lambda evt: transcribed(evt, name))
    transcriber.transcribing.connect(lambda evt: transcribing(evt, name))

    transcriber.start_transcribing_async()

    print(f"{name} started!")

    # Infinite Loop to continue transcription
    while True:
        pass

# Individual threads for each transcriber
threading.Thread(target=start_recognition, args=(microphone_audio_config, speech_config, "Microphone",)).start()
threading.Thread(target=start_recognition, args=(speaker_audio_config, speech_config, "Speaker",)).start()

This is an altered version of my code, but the main idea and my problem are both still there.

At regular conversation speeds, the transcriber falls strangely far behind, and the project is rendered useless. Even at painfully slow talking speeds, the transcription is failing.

I suspect this is due to the threads, but I cannot isolate the problem to threads or the Azure transcriber itself, or the Stereo Mix.

Let me know if you guys have any questions, and I will definitely answer them.


Solution

  • Instead of using a while True loop that can cause high CPU usage, make sure the transcribing function operates efficiently within an asynchronous context.

    Refactored version:

    # Azure Speech-to-Text Conversation Transcriber
    def transcribing(evt, name):
        print(f"{name} transcribing: {evt.result.text}")
    
    def transcribed(evt, name):
        print(f"{name} transcribed: {evt.result.text}")
    
    async def start_recognition(audio_config, speech_config, name, stop_event):
        transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
        
        transcriber.transcribed.connect(lambda evt: transcribed(evt, name))
        transcriber.transcribing.connect(lambda evt: transcribing(evt, name))
    
        await transcriber.start_transcribing_async()
        print(f"{name} started!")
    
        while not stop_event.is_set():
            await asyncio.sleep(0.1)  # Non-blocking wait
    
        await transcriber.stop_transcribing_async()
        print(f"{name} stopped!")
    
    def run_recognition_thread(audio_config, speech_config, name, stop_event):
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        loop.run_until_complete(start_recognition(audio_config, speech_config, name, stop_event))
    
    # Event to signal the threads to stop
    stop_event = threading.Event()
    
    # Individual threads for each transcriber
    microphone_thread = threading.Thread(target=run_recognition_thread, args=(microphone_audio_config, speech_config, "Microphone", stop_event))
    speaker_thread = threading.Thread(target=run_recognition_thread, args=(speaker_audio_config, speech_config, "Speaker", stop_event))
    
    # Start threads
    microphone_thread.start()
    speaker_thread.start()
    
    try:
        while True:
            # Main thread non-blocking wait
            if not microphone_thread.is_alive() or not speaker_thread.is_alive():
                break
            asyncio.sleep(1)
    except KeyboardInterrupt:
        stop_event.set()
    
    # Join threads to ensure clean exit
    microphone_thread.join()
    speaker_thread.join()
    

    Or else try this OtterPilot to record the audio from both devices.