I'm trying to make real time speech recognition software that will show text on screen (with tkinter). I already found a way to get an audio from PC output using Virtual Cable Audio, pyaudio and speech_recognition library. However i feel like im doing something wrong, of course AI is not perfect but sending voice in 6 seconds batches leads to AI wrong understanding of voice. Do you see any better way to do so? I could also use different API for speech recognition.
import time
import tkinter as tk
import speech_recognition as sr
from app.ui.window import Window
import wave
import pyaudiowpatch as pyaudio
if __name__ == "__main__":
recognizer = sr.Recognizer()
p = pyaudio.PyAudio()
device = p.get_default_input_device_info()
sample_rate = int(device["defaultSampleRate"])
duration_seconds = 6 # Duration to capture
total_frames = sample_rate * duration_seconds
buffer_size = 1024 # Size of each read
frames = []
overlap_frames = int(sample_rate * 0.5)
while True:
try:
with p.open(
channels=device["maxInputChannels"],
format=pyaudio.paInt16,
rate=sample_rate,
input=True,
frames_per_buffer=buffer_size,
input_device_index=device["index"],
) as stream:
print("Capturing audio...")
frames = [stream.read(buffer_size) for _ in range(total_frames // buffer_size)]
combined_audio = b''.join(frames[-overlap_frames:] + frames)
try:
recognized_text = recognizer.recognize_wit(
sr.AudioData(combined_audio, int(device["defaultSampleRate"]), 2),
key="API_KEY"
)
except Exception as e:
print(e)
continue
frames = []
print(f"Recognized: {recognized_text}")
except Exception as e:
print(f"Error: {e}")
p.terminate()
# root = tk.Tk()
# app = Window(root)
# root.mainloop()
If you just cut the audio into 6 second pieces at random positions, you will end up with partial words in the blocks. The recognizer won't be able to handle that.
The Python speech_recognition
library microphone example uses recognizer.listen()
, which detects gaps in speech by monitoring the audio volume. For continuous recording, you would use recognizer.listen_in_background()
which calls a function every time there is a gap in the speech. The gap thresholds can be adjusted as fields of the class instance (see default values).
This will give near realtime response, which performs the recognition as soon as you stop speaking for a moment.
If you need to update more often than that, the best approach would be to perform recognition on the whole audio block captured since previous gap. You can then update the recognized text as new information becomes available. The next word will often affect the probability estimate for prior words.