javascriptsocket.iogoogle-cloud-speech

How do I stream live audio from the browser to Google Cloud Speech via socket.io?


I have a situation with a React-based app where I have an input for which I wanted to allow voice input as well. I'm okay making this compatible with Chrome and Firefox only, so I was thinking of using getUserMedia. I know I'll be using Google Cloud's Speech to Text API. However, I have a few caveats:

  1. I want this to stream my audio data live, not just when I'm done recording. This means that a lot of solutions I've found won't work very well, because it's not sufficient to save the file and then send it out to Google Cloud Speech.
  2. I don't trust my front end with my Google Cloud API information. Instead, I already have a service running on the back end which has my credentials, and I want to stream the audio (live) to that back end, then from that back end stream to Google Cloud, and then emit updates to my transcript as they come in back to the Front End.
  3. I already connect to that back end service using socket.io, and I want to manage this entirely via sockets, without having to use Binary.js or anything similar.

Nowhere seems to have a good tutorial on how to do this. What do I do?


Solution

  • First, credit where credit is due: a huge amount of my solution here was created by referencing vin-ni's Google-Cloud-Speech-Node-Socket-Playground project. I had to adapt this some for my React app, however, so I'm sharing a few of the changes I made.

    My solution here was composed of four parts, two on the front end and two on the back end.

    My front end solution was of two parts:

    1. A utility file to access my microphone, stream audio to the back end, retrieve data from the back end, run a callback function each time that data was received from the back end, and then clean up after itself either when done streaming or when the back end threw an error.
    2. A microphone component which wrapped my React functionality.

    My back end solution was of two parts:

    1. A utility file to handle the actual speech recognize stream
    2. My main.js file

    (These don't need to be separated by any means; our main.js file is just already a behemoth without it.)

    Most of my code will just be excerpted, but my utilities will be shown in full because I had a lot of problem with all of the stages involved. My front end utility file looked like this:

    // Stream Audio
    let bufferSize = 2048,
        AudioContext,
        context,
        processor,
        input,
        globalStream;
    
    //audioStream constraints
    const constraints = {
        audio: true,
        video: false
    };
    
    let AudioStreamer = {
        /**
         * @param {function} onData Callback to run on data each time it's received
         * @param {function} onError Callback to run on an error if one is emitted.
         */
        initRecording: function(onData, onError) {
            socket.emit('startGoogleCloudStream', {
                config: {
                    encoding: 'LINEAR16',
                    sampleRateHertz: 16000,
                    languageCode: 'en-US',
                    profanityFilter: false,
                    enableWordTimeOffsets: true
                },
                interimResults: true // If you want interim results, set this to true
            }); //init socket Google Speech Connection
            AudioContext = window.AudioContext || window.webkitAudioContext;
            context = new AudioContext();
            processor = context.createScriptProcessor(bufferSize, 1, 1);
            processor.connect(context.destination);
            context.resume();
    
            var handleSuccess = function (stream) {
                globalStream = stream;
                input = context.createMediaStreamSource(stream);
                input.connect(processor);
    
                processor.onaudioprocess = function (e) {
                    microphoneProcess(e);
                };
            };
    
            navigator.mediaDevices.getUserMedia(constraints)
                .then(handleSuccess);
    
            // Bind the data handler callback
            if(onData) {
                socket.on('speechData', (data) => {
                    onData(data);
                });
            }
    
            socket.on('googleCloudStreamError', (error) => {
                if(onError) {
                    onError('error');
                }
                // We don't want to emit another end stream event
                closeAll();
            });
        },
    
        stopRecording: function() {
            socket.emit('endGoogleCloudStream', '');
            closeAll();
        }
    }
    
    export default AudioStreamer;
    
    // Helper functions
    /**
     * Processes microphone data into a data stream
     * 
     * @param {object} e Input from the microphone
     */
    function microphoneProcess(e) {
        var left = e.inputBuffer.getChannelData(0);
        var left16 = convertFloat32ToInt16(left);
        socket.emit('binaryAudioData', left16);
    }
    
    /**
     * Converts a buffer from float32 to int16. Necessary for streaming.
     * sampleRateHertz of 1600.
     * 
     * @param {object} buffer Buffer being converted
     */
    function convertFloat32ToInt16(buffer) {
        let l = buffer.length;
        let buf = new Int16Array(l / 3);
    
        while (l--) {
            if (l % 3 === 0) {
                buf[l / 3] = buffer[l] * 0xFFFF;
            }
        }
        return buf.buffer
    }
    
    /**
     * Stops recording and closes everything down. Runs on error or on stop.
     */
    function closeAll() {
        // Clear the listeners (prevents issue if opening and closing repeatedly)
        socket.off('speechData');
        socket.off('googleCloudStreamError');
        let tracks = globalStream ? globalStream.getTracks() : null; 
            let track = tracks ? tracks[0] : null;
            if(track) {
                track.stop();
            }
    
            if(processor) {
                if(input) {
                    try {
                        input.disconnect(processor);
                    } catch(error) {
                        console.warn('Attempt to disconnect input failed.')
                    }
                }
                processor.disconnect(context.destination);
            }
            if(context) {
                context.close().then(function () {
                    input = null;
                    processor = null;
                    context = null;
                    AudioContext = null;
                });
            }
    }
    

    The main salient point of this code (aside from the getUserMedia configuration, which was in and of itself a bit dicey) is that the onaudioprocess callback for the processor emitted speechData events to the socket with the data after converting it to Int16. My main changes here from my linked reference above were to replace all of the functionality to actually update the DOM with callback functions (used by my React component) and to add some error handling that wasn't included in the source.

    I was then able to access this in my React Component by just using:

    onStart() {
        this.setState({
            recording: true
        });
        if(this.props.onStart) {
            this.props.onStart();
        }
        speechToTextUtils.initRecording((data) => {
            if(this.props.onUpdate) {
                this.props.onUpdate(data);
            }   
        }, (error) => {
            console.error('Error when recording', error);
            this.setState({recording: false});
            // No further action needed, as this already closes itself on error
        });
    }
    
    onStop() {
        this.setState({recording: false});
        speechToTextUtils.stopRecording();
        if(this.props.onStop) {
            this.props.onStop();
        }
    }
    

    (I passed in my actual data handler as a prop to this component).

    Then on the back end, my service handled three main events in main.js:

    // Start the stream
                socket.on('startGoogleCloudStream', function(request) {
                    speechToTextUtils.startRecognitionStream(socket, GCSServiceAccount, request);
                });
                // Receive audio data
                socket.on('binaryAudioData', function(data) {
                    speechToTextUtils.receiveData(data);
                });
    
                // End the audio stream
                socket.on('endGoogleCloudStream', function() {
                    speechToTextUtils.stopRecognitionStream();
                });
    

    My speechToTextUtils then looked like:

    // Google Cloud
    const speech = require('@google-cloud/speech');
    let speechClient = null;
    
    let recognizeStream = null;
    
    module.exports = {
        /**
         * @param {object} client A socket client on which to emit events
         * @param {object} GCSServiceAccount The credentials for our google cloud API access
         * @param {object} request A request object of the form expected by streamingRecognize. Variable keys and setup.
         */
        startRecognitionStream: function (client, GCSServiceAccount, request) {
            if(!speechClient) {
                speechClient = new speech.SpeechClient({
                    projectId: 'Insert your project ID here',
                    credentials: GCSServiceAccount
                }); // Creates a client
            }
            recognizeStream = speechClient.streamingRecognize(request)
                .on('error', (err) => {
                    console.error('Error when processing audio: ' + (err && err.code ? 'Code: ' + err.code + ' ' : '') + (err && err.details ? err.details : ''));
                    client.emit('googleCloudStreamError', err);
                    this.stopRecognitionStream();
                })
                .on('data', (data) => {
                    client.emit('speechData', data);
    
                    // if end of utterance, let's restart stream
                    // this is a small hack. After 65 seconds of silence, the stream will still throw an error for speech length limit
                    if (data.results[0] && data.results[0].isFinal) {
                        this.stopRecognitionStream();
                        this.startRecognitionStream(client, GCSServiceAccount, request);
                        // console.log('restarted stream serverside');
                    }
                });
        },
        /**
         * Closes the recognize stream and wipes it
         */
        stopRecognitionStream: function () {
            if (recognizeStream) {
                recognizeStream.end();
            }
            recognizeStream = null;
        },
        /**
         * Receives streaming data and writes it to the recognizeStream for transcription
         * 
         * @param {Buffer} data A section of audio data
         */
        receiveData: function (data) {
            if (recognizeStream) {
                recognizeStream.write(data);
            }
        }
    };
    

    (Again, you don't strictly need this util file, and you could certainly put the speechClient as a const on top of the file depending on how you get your credentials; this is just how I implemented it.)

    And that, finally, should be enough to get you started on this. I encourage you to do your best to understand this code before you reuse or modify it, as it may not work 'out of the box' for you, but unlike all other sources I have found, this should get you at least started on all involved stages of the project. It is my hope that this answer will prevent others from suffering like I have suffered.