iosswiftapple-watchwatchconnectivitysfspeechrecognizer

Streaming audio from Watch to iPhone to use SFSpeechRecognizer


I want to do speech recognition in my Watch app, displaying a live transcription. Since SFSpeechRecognizer isn't available on WatchOS, I set the app up to stream audio to the iOS companion, using WatchConnectivity. Before attempting this, I tried the same on iPhone, same code without involving the Watch - it works there.

With my streaming attempt, the companion will receive audio chunks and not throw any errors, but it won't transcribe any text either. I suspect I did something wrong, when converting from AVAudioPCMBuffer and back, but I can't quite put my finger on it, as I lack experience, working with raw data and pointers.

Now, the whole thing works as follows:

  1. User presses button, triggering Watch to ask iPhone to set up a recognitionTask
  2. iPhone sets up recognitionTask and answers with ok or some error:
guard let speechRecognizer = self.speechRecognizer else {
    WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.error("no speech recognizer")))
    return
}
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else {
    WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.error("speech recognition request denied by ios")))
    return
}
recognitionRequest.shouldReportPartialResults = true
if #available(iOS 13, *) {
    recognitionRequest.requiresOnDeviceRecognition = true
}

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
    if let result = result {
        let t = result.bestTranscription.formattedString
        WCManager.shared.sendWatchMessage(.recognizedSpeech(t))
    }
    
    if error != nil {
        self.recognitionRequest = nil
        self.recognitionTask = nil
        WCManager.shared.sendWatchMessage(.speechRecognition(.error("?")))
    }
}
WCManager.shared.sendWatchMessage(.speechRecognitionRequest(.ok))
  1. Watch sets up an audio session, installs a tap on the audio engine's input node and returns the audio format to iPhone:
do {
    try startAudioSession()
} catch {
    self.state = .error("couldn't start audio session")
    return
}

let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat)
    { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
        let audioBuffer = buffer.audioBufferList.pointee.mBuffers
        let data = Data(bytes: audioBuffer.mData!, count: Int(audioBuffer.mDataByteSize))
        if self.state == .running {
            WCManager.shared.sendWatchMessage(.speechRecognition(.chunk(data, frameCount: Int(buffer.frameLength))))
        }
    }
audioEngine.prepare()

do {
    let data = try NSKeyedArchiver.archivedData(withRootObject: recordingFormat, requiringSecureCoding: true)
    WCManager.shared.sendWatchMessage(.speechRecognition(.audioFormat(data)),
        errorHandler: { _ in
            self.state = .error("iphone unavailable")
    })
    self.state = .sentAudioFormat
} catch {
    self.state = .error("could not convert audio format")
}
  1. iPhone saves the audio format and returns .ok or .error():
guard let format = try? NSKeyedUnarchiver.unarchivedObject(ofClass: AVAudioFormat.self, from: data) else {
    // ...send back .error, destroy the recognitionTask
}
self.audioFormat = format
// ...send back .ok
  1. Watch starts the audio engine
try audioEngine.start()
  1. iPhone receives audio chunks and appends them to the recognitionRequest:
guard let pcm = AVAudioPCMBuffer(pcmFormat: audioFormat, frameCapacity: AVAudioFrameCount(frameCount)) else {
    // ...send back .error, destroy the recognitionTask
}

let channels = UnsafeBufferPointer(start: pcm.floatChannelData, count: Int(pcm.format.channelCount))
let data = chunk as NSData
data.getBytes(UnsafeMutableRawPointer(channels[0]), length: data.length)
recognitionRequest.append(pcm)

Any ideas are highly appreciated. Thanks for taking the time!


Solution

  • I forgot to update the AVAudioPCMBuffer.frameLength after copying the memory. It works flawlessly now, without any noticable delay :)

    // ...
    data.getBytes(UnsafeMutableRawPointer(channels[0]), length: data.length)
    pcm.frameLength = AVAudioFrameCount(frameCount)
    // ...