react-nativewebrtcweb-audio-apispeech-to-textcmusphinx

how to perform continuous speech to text on webrtc communication audio stream in mobile app


I am trying to add a continuous speech to text recognizer in a mobile application during a webrtc audio-only call.

I'm using react native on the mobile side, with the react-native-webrtc module and a custom web api for the signaling part. I've got the hand of the web api, so I am able to add the feature on it's side if it's the only solution, but I prefer to perform it on the client side to avoid consuming bandwidth if there is no need.

First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

I have merged the audio only webrtc demo with the audiovisualiser demonstration in one page but there, I did not find how to connect a mediaElementSourceNode (created via AudioContext.createMediaElementSource(remoteStream) at line 44 of streamvisualizer.js) to a web_speech_api SpeechRecognition class. In the Mozilla documentation, the audio stream seems to came with the constructor of the class, which may call the getUserMedia() api.

Second, during my researches I have found two open source speech to text engine : cmusphinx and mozilla's deep-speech. The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try. However, how to embed this in my react native application?

There are also Android and iOS natives webrtc modules, which I may be able to connect with cmusphinx platform specific bindings (iOS, Android) but I don't know about native classes inter-operability. Can you help me with that?

I haven't already created any "grammar" or define "hot-words" because I am not sure of technologies involved, but I can do it latter if I am able to connect a speech recognition engine to my audio stream.


Solution

  • You need to stream the audio to the ASR server by either adding another webrtc party on the call or by some other protocol (TCP/Websocket/etc). On the server you perform recognition and send results back.

    First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

    This is experimental and does not really work in Firefox. In Chrome it only takes microphone input directly, not dual stream from caller and callee.

    The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try.

    You will not be able to run this as local recognition inside your react native app