speech-recognitionspeech-to-textmozilla-deepspeech

Get alternative suggestions during speech recognition


I would like to use offline speech to text recognition, mostly for German language.

Especially, I want to use Mozilla DeepSpeech (a TensorFlow implementation of Baidu's DeepSpeech architecture), but I fear that the audio quality of the audio input is not good enough to produce low error rates (WER - word error rates).

(English) example:

The speaker said "know" but the engine might have understood "flow" or "show" or "go" or "know".

I would like to get [flow, show, go, know] back from the engine, so that I can afterwards manually decide which suggestion fits best. How can I get this?

Does other speech to text engines offer this possibility?


Solution

  • DeepSpeech have updated releases. For better inference results, you need to follow their instructions and suggestions such as, your input audio file should be on 16000 Hz, mono channel, and 16 bit. Audio resampling may affect the quality of inference, keep this in mind. I personally use SoX for resampling but there are other options, samplerate. Also, there are many good suggestions on their forum.

    There is a Python library called SpeechRecognition. They have some offline models and online API services for speech to text.