python azure python-multithreading speaker azure-speech

Transcribing Stereo Mix (Speaker Output) Audio with Azure AI Speech Service is very slow with threads?

I have been struggling for very long to transcribe real-time both microphone input and speaker output audios at the same time for call center use case. The background info of this project and my struggles are both documented in a previous question.

Code:

This is an altered version of my code, but the main idea and my problem are both still there.

At regular conversation speeds, the transcriber falls strangely far behind, and the project is rendered useless. Even at painfully slow talking speeds, the transcription is failing.

I suspect this is due to the threads, but I cannot isolate the problem to threads or the Azure transcriber itself, or the Stereo Mix.

Let me know if you guys have any questions, and I will definitely answer them.

Solution

Instead of using a while True loop that can cause high CPU usage, make sure the transcribing function operates efficiently within an asynchronous context.

Refactored version:

# Azure Speech-to-Text Conversation Transcriber
def transcribing(evt, name):
    print(f"{name} transcribing: {evt.result.text}")

def transcribed(evt, name):
    print(f"{name} transcribed: {evt.result.text}")

async def start_recognition(audio_config, speech_config, name, stop_event):
    transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
    
    transcriber.transcribed.connect(lambda evt: transcribed(evt, name))
    transcriber.transcribing.connect(lambda evt: transcribing(evt, name))

    await transcriber.start_transcribing_async()
    print(f"{name} started!")

    while not stop_event.is_set():
        await asyncio.sleep(0.1)  # Non-blocking wait

    await transcriber.stop_transcribing_async()
    print(f"{name} stopped!")

def run_recognition_thread(audio_config, speech_config, name, stop_event):
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(start_recognition(audio_config, speech_config, name, stop_event))

# Event to signal the threads to stop
stop_event = threading.Event()

# Individual threads for each transcriber
microphone_thread = threading.Thread(target=run_recognition_thread, args=(microphone_audio_config, speech_config, "Microphone", stop_event))
speaker_thread = threading.Thread(target=run_recognition_thread, args=(speaker_audio_config, speech_config, "Speaker", stop_event))

# Start threads
microphone_thread.start()
speaker_thread.start()

try:
    while True:
        # Main thread non-blocking wait
        if not microphone_thread.is_alive() or not speaker_thread.is_alive():
            break
        asyncio.sleep(1)
except KeyboardInterrupt:
    stop_event.set()

# Join threads to ensure clean exit
microphone_thread.join()
speaker_thread.join()

In the above, the main thread uses asyncio.sleep(1) for non-blocking waits and checks if either thread has stopped.

Or else try this OtterPilot to record the audio from both devices.