python pytorch signal-processing speech-to-text torchaudio

Real time speech recognition with CTC decoder

I am trying to implement real time ASR with CTC decoder. I refer to the following torchaudio example on how to use the CTC decoder. I use pyudio to listen to the microphone the output of which is byte string. Every chunk of the signal goes thru a loop with series of transformation until i get the tensor with the correct format.

bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M
acoustic_model = bundle.get_model()

mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paFloat32,
                  channels=1,
                  rate=16000,
                  input=True,
                  frames_per_buffer=4096)
data_t = torch.empty(0)
for i in range(0, number_of_chunks):
    data_bytes = stream.read(4096)
    data_tensor = torch.frombuffer(data_bytes, 
                                   dtype=torch.float32, 
                                   count=- 1, 
                                   offset=0, 
                                   requires_grad=False)
    data_t = torch.cat((data_tensor, data))
data_t2d = torch.stack((data_t, data_t)) #for some reason torchaudio wants it like this

emission, _ = acoustic_model(data_t2d)

beam_search_result = beam_search_decoder(emission)
beam_search_transcript = " ".join(beam_search_result[0][0].words).strip()

before going to the CTCdecoder the signal (which I called data_t2d) is a tensor like this torch.Size([2, 486560]) is passed to the acoustic model which transform into tensor with size like this torch.Size([2, 1520, 29]) - which I called emission. This is the most time consuming operation in the process. I am looking for a way to pass it thru the acoustic model chunk by chunk so the transformation can begin as soon as the microphone is active. If I parallelise the steps in the loop and make the acoustic model part of the loop so it will transform chunks which at the end are concatenated, some of the signal is lost. Is there a way to efficiently implement this process without losses? I referred to this example but the pipeline which is used there seems to have different characteristics and I am struggling to adapt that code to work with CTC decoder.

Solution

Thanks for the supplementary information you provided. Here are some things that might help (I hope).

About livestreaming microphone data

It might be easier to manage using callback function (see pyaudio doc) instead of sequential read from the microphone. That way, the callback function will be called each time you get data from the microphone, and in your case it would be to pass this audio through acoustic model and beam search decoder, maybe doing so in a subprocess.

Another way to do the same thing would be to launch 2 processes at the beginning of the program, one reading input from the microphone and putting it in a multiprocessing queue, while the second process reads through this same queue, and when a new input is seen, it is put through acoustic model and decoder, and you do what you want with the output.

About accelerating the acoustic model step

First, on line data_t2d = torch.stack((data_t, data_t)). I think what torchaudio really wants is an input of shape [B, T] where B is a batching dimension and T a temporal dimension. Since for you B would be 1, you can do data_t2d = data_t.unsqueeze(0) instead, giving less work to the acoustic model (currently it proccesses the data twice, once for each input in the stack, which makes it slower).

Then, sending the acoustic model and input to GPU for processing should greatly reduce time taken for processing. If you use callback functions in the first part, you might have some problems if the model is loaded to GPU multiple times (it could saturate the GPU memory). In the other case there is only one model in another process, so it should be ok to send this model and all incoming inputs to GPU for computation, and back to CPU after.

About the size of the input, concatenation issues...

Here you will have to choose, and the best way would be to try and see the quality of transcription you get. Having longer sentences processed at once should give better results (up to some point) but it would be less "live" transcription, and computation will be longer. Shorter sentences are faster to compute and results will be obtained faster but since some context will be lost the results might be less accurate. The cutting points might be an issue, but fixing it might be more complicated (splitting segments at word/utterance boundaries for example). A workaround could be to include a few seconds from last input to compute the new one.