androidspeech-recognitiontext-to-speech

TTS onDone callback never fires on Samsung (Android 15) post-SpeechRecognizer, even with AUDIOFOCUS_REQUEST_GRANTED


I'm facing a very specific, reproducible bug and I've hit a wall after trying all the standard solutions. I would appreciate any insight.

I am developing a voice assistant setup flow where the app uses SpeechRecognizer (STT) to get the user's name, followed by TextToSpeech (TTS) to confirm the settings.

The entire flow gets stuck because the TextToSpeech engine hangs immediately after the SpeechRecognizer session concludes. This issue has been reliably reproduced on a specific test device.

Environment

Technical Sequence of Events

Here is the exact flow leading to the failure, confirmed by logcat analysis:

  1. A SpeechRecognizer instance is created in a foreground service (VoiceSessionService) and starts listening.

  2. The user speaks, and the onResults callback is successfully triggered with the transcribed text.

  3. Immediately after receiving the result, the SpeechRecognizer instance is fully terminated via stopListening() and destroy(), and the service stops itself.

  4. The main thread waits for a 500ms delay using Handler.postDelayed.

  5. After the delay, it successfully requests and is granted audio focus (AUDIOFOCUS_REQUEST_GRANTED) using AudioFocusRequest.Builder.

  6. A call is made to TextToSpeech.speak() to utter the first confirmation phrase.

  7. FAILURE POINT: The TextToSpeech engine accepts the command (the call to .speak() returns without error) but then goes silent. It never plays the audio and, most critically, it never triggers the onDone or onError callbacks in its UtteranceProgressListener.

The application logic is now permanently blocked, waiting for an onDone callback that will never arrive.

Relevant Code & Logs

ConversationManager.kt - Logic for handling the STT result:

// This method is called after the STT result is received
fun onUserInput(text: String) {
    if (awaitingFirstRunName) {
        awaitingFirstRunName = false
        val nameToSave = if (text.isNotBlank()) text.trim() else "Default Name"
        settings.saveUserName(nameToSave)
        
        Handler(Looper.getMainLooper()).postDelayed({
            val audioManager = appContext.getSystemService(Context.AUDIO_SERVICE) as AudioManager
            val audioAttributes = AudioAttributes.Builder()
                .setUsage(AudioAttributes.USAGE_ASSISTANCE_ACCESSIBILITY)
                .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
                .build()
            val focusRequest = AudioFocusRequest.Builder(AudioManager.AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK)
                .setAudioAttributes(audioAttributes)
                .build()

            val result = audioManager.requestAudioFocus(focusRequest)

            if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {
                Log.d("MY_APP_LOG", "Audio focus granted. Forcing TTS engine restart.")
                
                // Forcing a restart of the TTS Engine
                TtsManager.shutdown()
                TtsManager.initialize(appContext)
                
                val confirmationPhrase = "OK, $nameToSave"
                TtsManager.speak(
                    context = appContext,
                    text = confirmationPhrase,
                    queueMode = TextToSpeech.QUEUE_ADD,
                    onDone = {
                        // This callback is never triggered
                        Log.d("MY_APP_LOG", "Confirmation phrase DONE. This is never logged.")
                        speakFinalSettings(appContext)
                    }
                )
            } else {
                Log.e("MY_APP_LOG", "Could not get audio focus.")
            }
        }, 500)
        return
    }
    // ... logic for other commands follows
}

VoiceSessionService.kt - Ensuring SpeechRecognizer is destroyed:

private fun processCommand(text: String) {
    val intent = Intent(CommandReceiver.ACTION_PROCESS_COMMAND).apply {
        putExtra(CommandReceiver.EXTRA_RECO_TEXT, text)
    }
    sendBroadcast(intent)
    
    // Immediately destroy the recognizer to release all audio resources
    speechRecognizer?.stopListening()
    speechRecognizer?.destroy()
    speechRecognizer = null
    
    stopSelf()
}

Logcat Snippet:

D/RescueService: onResults: Andrei
D/RescueService: First run: User name saved as 'Andrei'
// --- 500ms delay occurs here ---
D/RescueService: Audio focus granted for the first time setup confirmation.
D/RescueService: Forcing TTS engine restart.
D/RescueService: TTS speaking: 'OK, Andrei' (queue=1) id=...
// --- SILENCE ---
// --- No "Confirmation phrase DONE" log ever appears ---

UPDATE: Helper Class Definitions as Requested

Here is the source code for the helper classes used in the snippets above.

TtsManager.kt

package com.babenko.rescueservice.voice

import android.content.Context
import android.os.Handler
import android.os.Looper
import android.speech.tts.TextToSpeech
import android.speech.tts.UtteranceProgressListener
import com.babenko.rescueservice.core.Logger
import java.util.*

object TtsManager : TextToSpeech.OnInitListener {
    private var tts: TextToSpeech? = null
    private var isInitialized = false
    private var appContext: Context? = null
    private val handler = Handler(Looper.getMainLooper())
    private val pending = ArrayDeque<Triple<String, Int, (() -> Unit)?>>()
    private val utteranceCallbacks = mutableMapOf<String, () -> Unit>()

    fun initialize(context: Context) {
        if (tts != null && isInitialized) return
        appContext = context.applicationContext
        try {
            // Binding explicitly to Google TTS Engine
            tts = TextToSpeech(context.applicationContext, this, "com.google.android.tts")
        } catch (e: Exception) {
            Logger.e(e, "Failed to create TextToSpeech")
        }
    }

    override fun onInit(status: Int) {
        if (status != TextToSpeech.SUCCESS) {
            isInitialized = false
            pending.clear()
            return
        }
        isInitialized = true
        tts?.setOnUtteranceProgressListener(object : UtteranceProgressListener() {
            override fun onStart(utteranceId: String?) {}
            override fun onDone(utteranceId: String?) {
                utteranceId?.let {
                    handler.post {
                        utteranceCallbacks[it]?.invoke()
                        utteranceCallbacks.remove(it)
                    }
                }
            }
            @Deprecated("Deprecated in Java")
            override fun onError(utteranceId: String?) {
                utteranceId?.let { utteranceCallbacks.remove(it) }
            }
        })
        
        // Process any pending speech requests
        while (pending.isNotEmpty()) {
            val (text, mode, onDone) = pending.removeFirst()
            internalSpeak(text, mode, onDone)
        }
    }
    
    fun speak(context: Context, text: String, queueMode: Int = TextToSpeech.QUEUE_ADD, onDone: (() -> Unit)? = null) {
        if (!isInitialized) {
            pending.addLast(Triple(text, queueMode, onDone))
            initialize(context)
            return
        }
        internalSpeak(text, mode, onDone)
    }

    private fun internalSpeak(text: String, queueMode: Int, onDone: (() -> Unit)? = null) {
        val utteranceId = text.hashCode().toString() + System.currentTimeMillis()
        onDone?.let { utteranceCallbacks[utteranceId] = it }
        tts?.speak(text, queueMode, null, utteranceId)
    }
    
    fun shutdown() {
        tts?.stop()
        tts?.shutdown()
        tts = null
        isInitialized = false
        pending.clear()
        utteranceCallbacks.clear()
    }
}

ConversationManager.kt

package com.babenko.rescueservice.voice

import android.content.Context
import android.media.AudioAttributes
import android.media.AudioFocusRequest
import android.media.AudioManager
import android.os.Handler
import android.os.Looper
import android.speech.tts.TextToSpeech
import android.util.Log
import com.babenko.rescueservice.data.SettingsManager // Assuming this is your settings helper

object ConversationManager {
    private lateinit var appContext: Context
    private var awaitingFirstRunName = false
    private val settings: SettingsManager by lazy { SettingsManager.getInstance(appContext) }

    fun init(context: Context) {
        appContext = context.applicationContext
    }

    fun startFirstRunSetup() {
        // ... (Code to speak the welcome message and start STT)
    }

    fun onUserInput(text: String) {
        if (awaitingFirstRunName) {
            awaitingFirstRunName = false
            val nameToSave = if (text.isNotBlank()) text.trim() else "Default Name"
            settings.saveUserName(nameToSave)
            
            Handler(Looper.getMainLooper()).postDelayed({
                val audioManager = appContext.getSystemService(Context.AUDIO_SERVICE) as AudioManager
                val audioAttributes = AudioAttributes.Builder()
                    .setUsage(AudioAttributes.USAGE_ASSISTANCE_ACCESSIBILITY)
                    .setContentType(AudioAttributes.CONTENT_TYPE_SPEECH)
                    .build()
                val focusRequest = AudioFocusRequest.Builder(AudioManager.AUDIOFOCUS_GAIN_TRANSIENT_MAY_DUCK)
                    .setAudioAttributes(audioAttributes)
                    .build()

                val result = audioManager.requestAudioFocus(focusRequest)

                if (result == AudioManager.AUDIOFOCUS_REQUEST_GRANTED) {
                    Log.d("MY_APP_LOG", "Audio focus granted. Forcing TTS engine restart.")
                    
                    // Forcing a restart of the TTS Engine
                    TtsManager.shutdown()
                    TtsManager.initialize(appContext)
                    
                    val confirmationPhrase = "OK, $nameToSave"
                    TtsManager.speak(
                        context = appContext,
                        text = confirmationPhrase,
                        queueMode = TextToSpeech.QUEUE_ADD,
                        onDone = {
                            Log.d("MY_APP_LOG", "Confirmation phrase DONE. This is never logged.")
                            speakFinalSettings(appContext)
                        }
                    )
                } else {
                    Log.e("MY_APP_LOG", "Could not get audio focus.")
                }
            }, 500)
            return
        }
    }

    fun speakFinalSettings(context: Context) {
        val finalName = settings.getUserName()
        // ... (builds final string and calls TtsManager.speak)
        val confirmationMessage = "Settings confirmed for $finalName."
        TtsManager.speak(context, confirmationMessage, TextToSpeech.QUEUE_ADD)
    }
}

I have already implemented a series of fixes based on best practices to resolve audio resource conflicts between STT and TTS.

What I Tried:

  1. Explicitly bound to the Google TTS Engine (com.google.android.tts).

    • I was expecting this to prevent instability caused by vendor-specific (Samsung) TTS engines. The binding was successful, but the hang still occurs.
  2. Ensured immediate and full destruction of SpeechRecognizer using .destroy() right after getting a result.

    • I was expecting this to immediately release the microphone and audio channel, making them available for the TTS engine.
  3. Introduced a postDelayed barrier of 500ms between destroying SpeechRecognizer and attempting to use TextToSpeech.

    • I was expecting this to give the Android audio system enough time to process the resource release and avoid a race condition.
  4. Upgraded the audio focus request to the modern AudioFocusRequest.Builder method, specifying AudioAttributes for an accessibility assistant.

    • I was expecting that a more descriptive request would be prioritized and granted by the system. The actual result is that the request is successful (logcat confirms AUDIOFOCUS_REQUEST_GRANTED), but the TTS engine still hangs.
  5. Forced a full restart of the TTS engine by calling TtsManager.shutdown() and then TtsManager.initialize() immediately before speaking.

    • I was expecting this to clear any corrupt or "hung" state within the system's TTS service. This was my final attempt, but the problem persists even with this measure.

What I was expecting overall:

I expected that by correctly managing the SpeechRecognizer lifecycle, adding a delay, properly requesting and receiving audio focus, and even force-restarting the TTS engine, the TextToSpeech engine would function normally and play the audio.

Actual Result:

Despite all these measures and successfully acquiring audio focus, the TextToSpeech engine still enters a non-functional state. It accepts the .speak() command but never executes it or provides a completion callback (onDone).

This leads to my final question:

Why does the TextToSpeech engine hang in this manner after a SpeechRecognizer session, even when all documented best practices are followed and audio focus is successfully granted?


Solution

  • Fixed. The issue was that an implicit broadcast from a foreground service in a separate process was blocked on Android 14/15. We made the broadcast explicit and sent it immediately before stopping the service, restoring reliable delivery and the final voice confirmation.

    Additionally, the project already includes proper delay, audio-focus handling, and SR → TTS shutdown order, so the full voice flow is now stable.