I am trying to add a continuous speech to text recognizer in a mobile application during a webrtc audio-only call.
I'm using react native on the mobile side, with the react-native-webrtc module and a custom web api for the signaling part. I've got the hand of the web api, so I am able to add the feature on it's side if it's the only solution, but I prefer to perform it on the client side to avoid consuming bandwidth if there is no need.
First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition
I have merged the audio only webrtc demo with the audiovisualiser demonstration in one page but there, I did not find how to connect a mediaElementSourceNode
(created via AudioContext.createMediaElementSource(remoteStream)
at line 44 of streamvisualizer.js) to a web_speech_api SpeechRecognition
class. In the Mozilla documentation, the audio stream seems to came with the constructor of the class, which may call the getUserMedia()
api.
Second, during my researches I have found two open source speech to text engine : cmusphinx and mozilla's deep-speech. The first one have a js binding and seems great with the audioRecoder
that I can feed with my own mediaElementSourceNode
from the first try. However, how to embed this in my react native application?
There are also Android and iOS natives webrtc modules, which I may be able to connect with cmusphinx platform specific bindings (iOS, Android) but I don't know about native classes inter-operability. Can you help me with that?
I haven't already created any "grammar" or define "hot-words" because I am not sure of technologies involved, but I can do it latter if I am able to connect a speech recognition engine to my audio stream.
You need to stream the audio to the ASR server by either adding another webrtc party on the call or by some other protocol (TCP/Websocket/etc). On the server you perform recognition and send results back.
First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition
This is experimental and does not really work in Firefox. In Chrome it only takes microphone input directly, not dual stream from caller and callee.
The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try.
You will not be able to run this as local recognition inside your react native app