python artificial-intelligence speech-recognition pyaudio

Real time speech recognition from PC audio

I'm trying to make real time speech recognition software that will show text on screen (with tkinter). I already found a way to get an audio from PC output using Virtual Cable Audio, pyaudio and speech_recognition library. However i feel like im doing something wrong, of course AI is not perfect but sending voice in 6 seconds batches leads to AI wrong understanding of voice. Do you see any better way to do so? I could also use different API for speech recognition.

import time
import tkinter as tk
import speech_recognition as sr

from app.ui.window import Window
import wave

import pyaudiowpatch as pyaudio

if __name__ == "__main__":
    recognizer = sr.Recognizer()

    p = pyaudio.PyAudio()
    device = p.get_default_input_device_info()

    sample_rate = int(device["defaultSampleRate"])

    duration_seconds = 6  # Duration to capture
    total_frames = sample_rate * duration_seconds

    buffer_size = 1024  # Size of each read
    frames = []
    overlap_frames = int(sample_rate * 0.5)

    while True:
        try:
            with p.open(
                channels=device["maxInputChannels"],
                format=pyaudio.paInt16,
                rate=sample_rate,
                input=True,
                frames_per_buffer=buffer_size,
                input_device_index=device["index"],
            ) as stream:
                print("Capturing audio...")
                frames = [stream.read(buffer_size) for _ in range(total_frames // buffer_size)]
                combined_audio = b''.join(frames[-overlap_frames:] + frames)

                try:
                    recognized_text = recognizer.recognize_wit(
                        sr.AudioData(combined_audio, int(device["defaultSampleRate"]), 2),
                        key="API_KEY"
                    )
                except Exception as e:
                    print(e)
                    continue

                frames = []

                print(f"Recognized: {recognized_text}")
        except Exception as e:
            print(f"Error: {e}")
            p.terminate()

    # root = tk.Tk()
    # app = Window(root)
    # root.mainloop()

Solution

If you just cut the audio into 6 second pieces at random positions, you will end up with partial words in the blocks. The recognizer won't be able to handle that.

The Python speech_recognition library microphone example uses recognizer.listen(), which detects gaps in speech by monitoring the audio volume. For continuous recording, you would use recognizer.listen_in_background() which calls a function every time there is a gap in the speech. The gap thresholds can be adjusted as fields of the class instance (see default values).

This will give near realtime response, which performs the recognition as soon as you stop speaking for a moment.

If you need to update more often than that, the best approach would be to perform recognition on the whole audio block captured since previous gap. You can then update the recognized text as new information becomes available. The next word will often affect the probability estimate for prior words.