I'm setting up a project with microsoft speech to text. It works fine, and i'm able to transcript what I say into text and send it later to other signalR subscribers.
However, I now need to interface it with the Speaker recognition. In other words : I want my speech to text to recognize only a few speakers.
Currently I use the classic TranslationRecognizer class, which gets the default microphone and translates on the fly.
I then use the StartContinuousRecognitionAsync class to start recognition.
Is there a way to get the audio flux before it is sent to the translation service to check if the user is the right one, and then after the verification is OK, resume the standard execution?
I assume this would be the best idea, but I'm open to any idea or architecture change.
Thanks for your input
Thanks for reaching us! Currently, speaker diarization (i.e. who is speaking) is only available in our batch transcription service, but not yet for real-time speech recognition. However, if you are able to separate speakers by yourself, e.g. based on audio channel, you can feed audio stream for a particular speaker via AudioInputStream interface to Speech SDK for recognition.
Thanks.