machine-learningspeech-recognitionspeech-to-textctcgoogle-speech-to-text-api

What do confidence scores mean in speech recognition?


A lot of speech to text services (such as Google's) provide a confidence score. At least for Google it is between 0 and 1, but is clearly not the probability that a particular transcription is correct, as confidences for alternative transcriptions add up to more than 1. Also a higher-confidence result is sometimes ranked lower.

So, what is it? Is there a recognized meaning of 'confidence score' in the speech recognition community? I have seen references to minimum Bayes risk but even if that is what they are doing, this doesn't much answer the question since that depends on a choice of auxiliary loss function.


Solution

  • but is clearly not the probability that a particular transcription is correct, as confidences for alternative transcriptions add up to more than 1

    Statistical algorithms never give you the value of probability, they give you estimates. The estimate might not be accurate in some cases, it is more that in average they approach the ideal. The confidence has to be calibrated. You can check some theory in

    Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IE https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/ConfidenceCalibration.pdf

    Is there a recognized meaning of 'confidence score' in the speech recognition community?

    Not really, everyone uses own algorithms. From simple Bayes Risk (which is not the best estimate at all) to much more advanced methods. It is not really possible to know what Google does. In Kaldi there is also an implementation of a good algorithm: https://github.com/kaldi-asr/kaldi/blob/master/egs/ami/s5/local/confidence_calibration.sh