python python-telegram-bot ogg librosa mozilla-deepspeech

How to decode .ogg opus to int16 NumPy array with librosa?

What I'm trying to do

I'm trying to transcribe Telegram audio messages, using Mozillas speech-to-text engine deepspeech.

Using *.wav in 16bit 16khz works flawless.

I want to add *.ogg opus support, since Telegram uses this format for it's audio messages.

What I have tried so far

I have tried pyogg and soundfile so far, with no luck.

Soundfile could outright not read the opus format and pyogg is a pain to install without conda. I had really weird moments where it literally crashed python.

Right now, I'm trying librosa with mixed results.

data, sample_rate = librosa.load(path)

tmp = np.array(data, np.float16)

tmp.dtype = np.int16

int16 = np.array(tmp, dtype=np.int16)

metadata = model.sttWithMetadata(int16)

Deepspeech really likes np.int16. model.sttWithMetadata is essentially the call for the transcriber.

Right now, it does transcribe something, but nowhere near anything resembling what I speak in my audio message.

Solution

librosa returns an array floats in range -1.0 to 1.0. In int16 the maximum value is 32767. So you have to multiply to scale the signal, then convert to int16.

data, sample_rate = librosa.load(path)

int16 = (data * 32767).astype(np.int16)

metadata = model.sttWithMetadata(int16)

Quick explanation why 32767:

In 16-bit computing, an integer can store 216 distinct values.

That means, unsigned integers can range from 0 to 65,535 and the two complement representation from -32,768 to 32,767. This means, a processor with 16-bot memory addresses can access 64KB (or 64 * 1024 = 65,436 unique addresses) of memory at a time.

If our float array then has values, ranging from -1.0 to 1.0, we scale the signal by a factor of 32,767 to make it compatible with the 16 bit addresses your deepspeech model expects to find.