I have been struggling for very long to transcribe real-time both microphone input and speaker output audios at the same time for call center use case. The background info of this project and my struggles are both documented in a previous question.
Code:
import azure.cognitiveservices.speech as speechsdk
from dotenv import load_dotenv
import os
import threading
# Store credentials in .env file, initialize speech_config
load_dotenv()
audio_key = os.getenv("audio_key")
audio_region = os.getenv("audio_region")
speech_config = speechsdk.SpeechConfig(subscription=audio_key, region=audio_region)
speech_config.speech_recognition_language = "en-US"
# Endpoint strings found using aforementioned code
mic = "{0.0.1.00000000}.{6dd64d0d-e876-4f3f-b1fe-464843289599}"
stereo_mix = "{0.0.1.00000000}.{c4c4d95c-5bd1-4f09-a07e-ad3a96c381f0}"
# Initialize audio_config as shown in Azure documentation
microphone_audio_config = speechsdk.audio.AudioConfig(device_name=mic)
speaker_audio_config = speechsdk.audio.AudioConfig(device_name=stereo_mix)
# Azure Speech-to-Text Conversation Transcriber
def transcribing(evt, name):
print(f"{name} transcribing! {evt}")
def transcribed(evt, name):
print(f"{name} transcribed! {evt}")
# Function to start Azure speech recognition
def start_recognition(audio_config, speech_config, name):
transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
transcriber.transcribed.connect(lambda evt: transcribed(evt, name))
transcriber.transcribing.connect(lambda evt: transcribing(evt, name))
transcriber.start_transcribing_async()
print(f"{name} started!")
# Infinite Loop to continue transcription
while True:
pass
# Individual threads for each transcriber
threading.Thread(target=start_recognition, args=(microphone_audio_config, speech_config, "Microphone",)).start()
threading.Thread(target=start_recognition, args=(speaker_audio_config, speech_config, "Speaker",)).start()
This is an altered version of my code, but the main idea and my problem are both still there.
At regular conversation speeds, the transcriber falls strangely far behind, and the project is rendered useless. Even at painfully slow talking speeds, the transcription is failing.
I suspect this is due to the threads, but I cannot isolate the problem to threads or the Azure transcriber itself, or the Stereo Mix.
Let me know if you guys have any questions, and I will definitely answer them.
Instead of using a while True
loop that can cause high CPU usage, make sure the transcribing function operates efficiently within an asynchronous context.
Refactored version:
# Azure Speech-to-Text Conversation Transcriber
def transcribing(evt, name):
print(f"{name} transcribing: {evt.result.text}")
def transcribed(evt, name):
print(f"{name} transcribed: {evt.result.text}")
async def start_recognition(audio_config, speech_config, name, stop_event):
transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)
transcriber.transcribed.connect(lambda evt: transcribed(evt, name))
transcriber.transcribing.connect(lambda evt: transcribing(evt, name))
await transcriber.start_transcribing_async()
print(f"{name} started!")
while not stop_event.is_set():
await asyncio.sleep(0.1) # Non-blocking wait
await transcriber.stop_transcribing_async()
print(f"{name} stopped!")
def run_recognition_thread(audio_config, speech_config, name, stop_event):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(start_recognition(audio_config, speech_config, name, stop_event))
# Event to signal the threads to stop
stop_event = threading.Event()
# Individual threads for each transcriber
microphone_thread = threading.Thread(target=run_recognition_thread, args=(microphone_audio_config, speech_config, "Microphone", stop_event))
speaker_thread = threading.Thread(target=run_recognition_thread, args=(speaker_audio_config, speech_config, "Speaker", stop_event))
# Start threads
microphone_thread.start()
speaker_thread.start()
try:
while True:
# Main thread non-blocking wait
if not microphone_thread.is_alive() or not speaker_thread.is_alive():
break
asyncio.sleep(1)
except KeyboardInterrupt:
stop_event.set()
# Join threads to ensure clean exit
microphone_thread.join()
speaker_thread.join()
asyncio.sleep(1)
for non-blocking waits and checks if either thread has stopped.Or else try this OtterPilot to record the audio from both devices.