I'm learning about Mozilla's DeepSpeech Speech-To-Text engine. I had no trouble getting the command line interface working, but the Python interface seems to be behaving differently. When I run:
deepspeech --model models/output_graph.pb --alphabet models/alphabet.txt --audio testFile3.wav
On a PCM, 16 bit, mono 48000 Hz .wav file generated with sox, I get the following:
test test apple benana
Minus the "benana" when I meant "banana" it seems to work fine, along with the other files I've tested it on. The problem comes when I try to use the following code which comes from this tutorial:
import deepspeech
import scipy.io.wavfile as wav
import sys
ds=deepspeech.Model(sys.argv[1],26,9,sys.argv[2],500)
fs,audio=wav.read(sys.argv[3])
processed_data=ds.stt(audio,fs)
print(processed_data)
I run the code with the following command:
python3 -Bi test.py models/output_graph.pb models/alphabet.txt testFile3.wav
Depending on the specific file, I get different four-character responses. The response I got from this particular file was 'hahm'
, but 'hmhm'
and ' eo'
are also common. Changing the parameters to the model (the 25, 9, and 500) don't seem to change the output.
just include your trie
and lm.binary
files and try again.
from deepspeech import Model
import scipy.io.wavfile
BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.25
N_FEATURES = 26
N_CONTEXT = 9
MODEL_FILE = 'output_graph.pbmm'
ALPHABET_FILE = 'alphabet.txt'
LANGUAGE_MODEL = 'lm.binary'
TRIE_FILE = 'trie'
ds = Model(MODEL_FILE, N_FEATURES, N_CONTEXT, ALPHABET_FILE, BEAM_WIDTH)
ds.enableDecoderWithLM(ALPHABET_FILE, LANGUAGE_MODEL, TRIE_FILE, LM_WEIGHT,
VALID_WORD_COUNT_WEIGHT)
def process(path):
fs, audio = scipy.io.wavfile.read(path)
processed_data = ds.stt(audio, fs)
return processed_data
process('sample.wav')
this might produce same response..use same audio files fir both inference and verify.. the audio files should be 16 bit 16000 hz and mono
recording..