Can the Microsoft Bing Speech API be configured to only return numbers and letters, as opposed to full words?
The use case is translating Canadian postal codes. Ex. M 1 B 0 R 3. Microsoft may return "Em 1 Be 0 Are 3"
Our audio file is 8000hz and encoded with "M-ULAW". We have no flexibility in changing the sample rate or encoding. We are using the "SMD" scenario, but I can't find any documentation on what this does. Base request URI:
https://speech.platform.bing.com/recognize?scenarios=smd&appid=D4D52672-91D7-4C74-8AD8-42B1D98141A5&device.os=your_device_os&version=3.0
Is there a way to get a more accurate response from Microsoft for this use case?
Thank you
You could try using Microsoft's Custom Speech Service (previously known as the Custom Recognition Intelligent Service, or CRIS) to create and use a custom language model.
The guidelines for transcription of custom language models say "Common acronyms can be left as a single entity without periods or spaces between the letters, but all other acronyms should be written out in separate letters, with each letter separated by a single space" and include this example:
Original text After normalization
----------------------- ---------------------------
play OU812 by Van Halen play O U 8 1 2 by Van Halen
So following their guidelines, your custom language model will be a file where each line looks something like this:
M 1 B 0 R 3
You can easily generate a file containing thousands of examples of Canadian postal codes based on the structure of the codes, which in regular expression format looks like this:
[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]
(The above expression is taken from this answer about validating postal codes.)
By doing this you're telling the recognizer what sort of things you're expecting people to say, and helping it choose when there are multiple possibilities for a sound (e.g. "U" vs. "you"). I think it will make a huge difference in the results you get.