I am using Google's cloud speech API to translate sentences in real time. It works well, however the first 5-6 seconds after creating a streammingRecognize and feeding it audio input it doesn't send any response. After that initial delay, I get bombarded with a quick succession of interim responses showing that the audio was processed in entirety. Appart from that, there are no more delays if you continue speaking and the responses are close to real time.
Normally, I would say that the delay is the connection being established or some configurations being made, however there is no such initial delay when testing Google's streamming example from here(Perform streaming speech recognition on an audio stream).
I am using a Javascript client that sends audio blobs from RecordRTC through sockets from socket.io to the Nodejs server.
The code snippet that deals with the stream:
const speech = require('@google-cloud/speech');
const client = new speech.SpeechClient();
// ...
socket.on("init", (config) => {
recognizeStream = client
.streamingRecognize({
config: config,
interimResults: true
})
.on('error', console.error)
.on('data', data => {
if (data.results[0] && data.results[0].alternatives[0]) {
resHandler(data, socket);
}
else {
LOG.warning(`No data`);
}
});
socket.once("microphone_blob", (firstBlob) => {
// Change the .wav header
// Set file size to 2147479588
firstBlob[4] = 0x24;
firstBlob[5] = 0xf0;
firstBlob[6] = 0xff;
firstBlob[7] = 0x7f;
// Set data size to 2147479552, aka size - headder
firstBlob[40] = 0x00;
firstBlob[41] = 0xf0;
firstBlob[42] = 0xff;
firstBlob[43] = 0x7f;
recognizeStream.write(firstBlob);
socket.on("microphone_blob", (blob) => {
// Cut out the header
recognizeStream.write(blob.slice(44));
});
});
});
The blobs have a .wav format and each blob has its own header. On the first received blob I change the file and data size of the header so that the stream continues with the following blobs. After that I cut out the headers from the following blobs as it is just noise.
The recognizeStream config and RecordRTC config:
let sampleRateHertz = 48000;
// ...
socket.emit("init", {
encoding: "WEBM_OPUS",
languageCode: "en-US",
sampleRateHertz: sampleRateHertz,
audioChannelCount: 1,
});
recordAudio = RecordRTC(stream, {
type: "audio",
mimeType: "audio/webm",
sampleRate: sampleRateHertz,
recorderType: StereoAudioRecorder,
numberOfAudioChannels: 1,
timeSlice: 100,
// as soon as the stream is available
ondataavailable: function (blob) {
socket.emit("microphone_blob", blob);
},
});
The only difference between Google's example an my code that I could find is the encoding "WEBM_OPUS" vs "LINEAR16" and the sampleRateHertz 48 kHz vs 16 kHz (and implicitly the header's sample rate and byte rate). The socket communication has an insignificant delay so the fact that this is a client-server app vs the backend only example should not be a problem. I load my google credentials the same way (form a .env file), so that should't be a problem either.
I have no ideea what could cause this initial delay, is it the sample rate or the encoding, or something else that I missed entirely?
In the end I found the solution. It was indeed the encoding that was causing that delay. Changing "WEBM_OPUS" into "LINEAR16" fixed it, and now the delay is nonexistent. I tried to swap it once and it stopped working completely so I didn't try again. I don't know if I changed anything or if I made a mistake when I first swapped it.
I will leave the question up in case anyone needs this.