I'm trying to transcribe Telegram audio messages, using Mozillas speech-to-text engine deepspeech.
Using *.wav
in 16bit 16khz works flawless.
I want to add *.ogg
opus support, since Telegram uses this format for it's audio messages.
I have tried pyogg and soundfile so far, with no luck.
Soundfile could outright not read the opus format and pyogg is a pain to install without conda. I had really weird moments where it literally crashed python.
Right now, I'm trying librosa with mixed results.
data, sample_rate = librosa.load(path)
tmp = np.array(data, np.float16)
tmp.dtype = np.int16
int16 = np.array(tmp, dtype=np.int16)
metadata = model.sttWithMetadata(int16)
Deepspeech really likes np.int16
. model.sttWithMetadata
is essentially the call for the transcriber.
Right now, it does transcribe something, but nowhere near anything resembling what I speak in my audio message.
librosa returns an array floats in range -1.0
to 1.0
. In int16 the maximum value is 32767
. So you have to multiply to scale the signal, then convert to int16.
data, sample_rate = librosa.load(path)
int16 = (data * 32767).astype(np.int16)
metadata = model.sttWithMetadata(int16)
Quick explanation why 32767:
In 16-bit computing, an integer can store 216 distinct values.
That means, unsigned integers can range from 0 to 65,535 and the two complement representation from -32,768 to 32,767. This means, a processor with 16-bot memory addresses can access 64KB (or 64 * 1024 = 65,436 unique addresses) of memory at a time.
If our float array then has values, ranging from -1.0 to 1.0, we scale the signal by a factor of 32,767 to make it compatible with the 16 bit addresses your deepspeech model expects to find.