
Google Speech To Text API: Get word confidences for the interim results in real time

I am using the google speech to text API in Node.js.

I'm doing the following

  config: {
    encoding: 'LINEAR16',
    sampleRateHertz: 16000,
    languageCode: 'en-US',
    enableAutomaticPunctuation: true,
    metadata: {
      interactionType: 'PHONE_CALL',
      microphoneDistance: 'NEARFIELD',
      originalMediaType: 'VIDEO',
      recordingDeviceType: 'PC'
    model: 'video',
    useEnhanced: true,
    enableWordConfidence: true,
    enableWordTimeOffsets: true,
    diarizationConfig: {
      enableSpeakerDiarization: true,
      minSpeakerCount: 1,
      maxSpeakerCount: 6
  interimResults: true,
  single_utterance: false

and when I give it a short clip from The Wolf of Wall Street, the responses I get are like this for the interim results:

  results: [
      alternatives: [{
        words: [],
        transcript: 'Hey John, thank you for your vote of confidence and welcome to the',
        confidence: 0
      isFinal: false,
      stability: 0.8999999761581421,
      resultEndTime: [Object],
      channelTag: 0,
      languageCode: 'en-us'
      alternatives: [{ words: [], transcript: ' investor Center.', confidence: 0 }],
      isFinal: false,
      stability: 0.009999999776482582,
      resultEndTime: [Object],
      channelTag: 0,
      languageCode: 'en-us'
  error: null,

and like this for the results marked as final:

  words: [
      startTime: [Object],
      endTime: [Object],
      word: 'Hey',
      confidence: 0.550264298915863,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'John,',
      confidence: 0.7241439819335938,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'thank',
      confidence: 0.9128385782241821,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'you',
      confidence: 0.7003968358039856,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'for',
      confidence: 0.7170425057411194,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'your',
      confidence: 0.9128385782241821,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'vote',
      confidence: 0.7738808989524841,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'of',
      confidence: 0.7003968358039856,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'confidence',
      confidence: 0.5876403450965881,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'and',
      confidence: 0.9128385782241821,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'welcome',
      confidence: 0.9128385782241821,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'to',
      confidence: 0.7243974208831787,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'the',
      confidence: 0.657508909702301,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'investors',
      confidence: 0.6374689936637878,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'Center.',
      confidence: 0.7192383408546448,
      speakerTag: 0
      startTime: [Object],
      endTime: [Object],
      word: 'Bye-bye.',
      confidence: 0.6980124115943909,
      speakerTag: 0
  transcript: 'Hey John, thank you for your vote of confidence and welcome to the investors Center. Bye-bye.',
  confidence: 0.7401091456413269

Is there any way to get the word confidences for the interim results? Thanks for any help or insights!


  • Unfortunately there is no way to get word confidences on interim results. The confidence is set up in a way that, it will be only populated when is_final=true. See document reference.

    confidence - The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.

    But you can try and create a speech to text API feature request to output the word confidence in the interim results.