I’m trying to use the ChatGPT Realtime API to recognize spoken commands and to have natural language conversation.
What I need:
User speaks a command (e.g. “Turn on the living room lights”).
Note the user can also talk things that are not comands, and in this case the API should obviously not recognize any command and only answer in its own way.
I want API to return two things at once:
A structured text output (e.g. JSON) so my program can parse the command (with a "command id" if there is a command, otherwise "command id" should be empty).
A natural speech audio output (so the user hears something like “OK, turning on the lights”), but not the raw JSON read out loud.
The problem: If I enable the audio output modality, the model will also try to “speak” whatever it outputs as text. But in my case, I want the text to be machine-readable (structured JSON), and the audio to be user-friendly speech.
Here’s the actual code that I need help with:
// Node 18+
// npm i ws speaker
import WebSocket from 'ws';
import fs from 'fs';
import Speaker from 'speaker';
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
if (!OPENAI_API_KEY) {
console.error('Set OPENAI_API_KEY'); process.exit(1);
}
const REALTIME_URL = 'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview';
const SAMPLE_RATE = 24000;
const MODE = process.env.MODE || 'BAD'; // BAD | GOOD
const PCM_FILE = process.env.PCM_FILE || 'sample.pcm';
// Basic audio out (PCM16 mono 24kHz)
const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: SAMPLE_RATE, signed: true });
function playPcmBase64Chunk(b64) {
const buf = Buffer.from(b64, 'base64');
speaker.write(buf);
}
const ws = new WebSocket(REALTIME_URL, {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
'OpenAI-Beta': 'realtime=v1'
}
});
// Buffers to capture outputs for debugging
let textBuf = '';
let audioStarted = false;
ws.on('open', async () => {
console.log('[WS] connected, setting up session…');
// Session with audio in/out PCM16. (No tool calling; only text+audio.)
ws.send(JSON.stringify({
type: 'session.update',
session: {
modalities: ['text', 'audio'], // session supports both, but we’ll control per-response
input_audio_format: { type: 'pcm16', sample_rate: SAMPLE_RATE },
output_audio_format: { type: 'pcm16', sample_rate: SAMPLE_RATE }
}
}));
// Feed a short PCM16 sample (1–2 seconds)
const pcm = fs.readFileSync(PCM_FILE);
const CHUNK = 8000; // arbitrary small chunks
for (let i = 0; i < pcm.length; i += CHUNK) {
const slice = pcm.subarray(i, i + CHUNK);
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: slice.toString('base64')
}));
}
// finalize audio input
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
if (MODE === 'BAD') {
console.log('\nMODE=BAD → one response with ["text","audio"]');
ws.send(JSON.stringify({
type: 'response.create',
response: {
// both outputs in ONE response:
modalities: ['text', 'audio'],
// ask for JSON in text and a natural confirmation in audio
// but sometimes the audio ends up speaking JSON :(
instructions:
'Return ONLY JSON in the TEXT modality (machine-readable command). ' +
'Then produce a short, natural confirmation in the AUDIO modality. ' +
'Do NOT read or mention any JSON in speech.'
}
}));
} else {
console.log('\nMODE=GOOD → two responses: [text-only] then [audio-only]');
// 1) TEXT-ONLY: get JSON (works reliably)
ws.send(JSON.stringify({
type: 'response.create',
response: {
modalities: ['text'],
instructions:
'Return ONLY JSON. No extra words. JSON must represent the user command (intent, device, location, value).'
}
}));
// 2) AUDIO-ONLY: speak a natural confirmation (no JSON spoken)
// Small delay to avoid races; ideally wait for response.done of the previous.
setTimeout(() => {
ws.send(JSON.stringify({
type: 'response.create',
response: {
modalities: ['audio'],
instructions:
'Say a short, friendly confirmation. Do NOT read or mention any JSON.'
}
}));
}, 600);
}
});
// Handle server events (text + audio)
ws.on('message', (data) => {
const msg = JSON.parse(data);
// Text streaming
if (msg.type === 'response.output_text.delta' && msg.delta) {
textBuf += msg.delta;
}
if (msg.type === 'response.output_text.done') {
console.log('\n[TEXT DONE]\n' + textBuf);
}
// Audio streaming
if (msg.type === 'response.output_audio.delta' && msg.delta) {
if (!audioStarted) { audioStarted = true; console.log('\n[AUDIO streaming…]'); }
playPcmBase64Chunk(msg.delta);
}
if (msg.type === 'response.output_audio.done') {
console.log('\n[AUDIO DONE]');
try { speaker.end(); } catch {}
}
// For debugging
if (msg.type === 'error') {
console.error('[API ERROR]', msg);
}
});
process.on('SIGINT', () => { try { speaker.end(); } catch {} try { ws.close(); } catch {} process.exit(0); });
What I expect
In BAD mode (single response with ["text","audio"]):
TEXT: strict JSON (machine-readable).
AUDIO: natural confirmation (no JSON spoken).
In GOOD mode (two responses):
Works as expected (text-only JSON, then audio-only confirmation).
What actually happens
In BAD mode, the audio sometimes speaks the JSON (or parts of it), even though instructions say not to.
In GOOD mode, splitting into two responses avoids the issue but costs extra input tokens.
Questions
Is there a reliable way to keep a single response.create with ["text","audio"] where the audio never speaks the JSON?
Is there a recommended prompt pattern or API flag to mark certain textual content as non-verbal for audio synthesis?
Any working example or best practice would be appreciated.
Newer OpenAI realtime models, such as gpt-realtime and gpt-realtime-mini, support function calling (also known as tool calling), which is better suited for letting the AI perform actions in your program.
In the session.update command, add tool definitions for all actions you want the AI to be able to perform:
ws.send(JSON.stringify({
type: 'session.update',
session: {
modalities: ['text', 'audio'], // session supports both, but we’ll control per-response
input_audio_format: { type: 'pcm16', sample_rate: SAMPLE_RATE },
output_audio_format: { type: 'pcm16', sample_rate: SAMPLE_RATE },
// define your tools (functions) here:
tools: [
{
type: "function",
name: "turn_on",
description: "Turn an appliance on.",
parameters: {
type: "object",
properties: {
// properties are like arguments in programming languages
appliance: {
type: "string",
description: "Appliance to turn on"
}
},
// here you can set the required properties
required: ["appliance"],
additionalProperties: false
}
}
],
// IMPORTANT - add this to make sure that it uses the functions
tool_choice: "auto",
// also add system instructions that mention the tools
instructions: 'You are a smart home voice assistant. Use the turn_on tool if the user requests to turn an appliance on, passing the name of the appliance (e.g. "living room lights") in the "appliance" argument.'
}
}));
Then, in the part that handles server events, check for response.function_call_arguments.done. That event gets called once GPT calls a function. After this, you need to provide output to the server.
// Handle server events (text + audio)
ws.on('message', (data) => {
// checks for other events go here
if (msg.type === 'response.function_call_arguments.done') {
console.log('\n[TOOL CALLED]');
// msg.name is the name of the function, arguments are the arguments
const args = JSON.parse(msg.arguments);
if (msg.name === 'turn_on') {
// turn on device here
turn_on(args.appliance);
// now create a history item with the function's output (this will be sent to the AI)
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: "function_call_output",
call_id: msg.call_id,
// replace this with your actual status or any other information you want to send (e.g. light level...)
output: "Turned on appliance"
}
}));
}
}
});