[SOLVED] Do some LLMs understand the voice directly, or do they have to go through a text transcription stage?

I want to interact with an LLM via voice. In order to select the right model, I'd like to know if there are LLMs that understand voice directly. If not, I'll have to transcribe the user's voice into text and the model's response into audio.

Thanks for your help.

LLMs in general are text-to-text models but there are "multi-modal" models (like chatGPT4, Gemini 1.5 Pro and others) which can accept more methods of input (like images, audio, video, etc.). For your use case it seems like you can either use one of the above models with the audio directly or use a speech-to-text (like whisper) as a preprocessing step before passing the text to a text based model.