On May 8, 2026, OpenAI shipped three new voice models into its API — and the most significant of them changes what voice agents can actually do.
GPT-Realtime-2 is the first voice model in the Realtime API family to carry GPT-5-class reasoning. That change unlocks a category of use cases that were previously impractical: complex multi-step voice workflows, reliable agentic tool calling during spoken interactions, and sessions long enough to handle real work. The other two models — GPT-Realtime-Translate and GPT-Realtime-Whisper — address two other gaps that have frustrated voice app developers since the original Realtime API launched. This guide covers all three, with the patterns and code you need to build production voice agents today.
What Changed: Three New Voice Models
OpenAI released these models simultaneously on May 8:
- GPT-Realtime-2 — GPT-5-class reasoning for live voice conversations, with configurable reasoning effort, a 128K context window, parallel multi-tool calling, and natural interruption handling.
- GPT-Realtime-Translate — Live speech translation from 70+ input languages into 13 output languages, matching the speaker’s pace with synthesized target-language audio.
- GPT-Realtime-Whisper — Streaming speech-to-text that generates transcript text live as the speaker talks, not batch after silence detection.
Together they cover the three most common voice pipeline architectures: conversational AI, multilingual communication, and hybrid voice-plus-text workflows where you need a live transcript alongside a spoken response.
GPT-Realtime-2: What GPT-5-Class Reasoning Means for Voice
The original Realtime API used a voice-optimized model that was fast and fluent but shallow in reasoning. It struggled with multi-step logic, complex tool chains, and tasks requiring state across more than a few exchanges. It was better at sounding natural than at being correct on hard problems.
GPT-Realtime-2 inverts that priority. The underlying reasoning engine belongs to the same model family as GPT-5.5, which topped the May 2026 MMLU Pro and GPQA Diamond benchmarks. For voice agent developers, this means four concrete improvements:
- Reliable tool chaining. The model can call five tools in sequence, evaluate each result before calling the next, and maintain task context across the full chain without confabulating intermediate state.
- Parallel tool calls. GPT-Realtime-2 can issue multiple tool calls simultaneously and merge the results. A request like “book a meeting with all three of them tomorrow afternoon” fires three calendar API calls in parallel, not in sequence.
- Audible progress signals. During tool execution the model generates spoken filler matching what it’s doing: “checking your calendar now” or “looking that up.” This removes the dead air that made earlier voice agents feel broken during operations longer than 500ms.
- Stronger instruction following. System prompts with multi-clause constraints and conditional rules are reliably respected. Earlier Realtime models drifted from complex system prompts after four or five turns.
The 128K Context Window
The previous Realtime API supported 32K tokens. That sounds large until you factor in the real cost of a voice session: every exchange — question, tool call, result, response — adds tokens to the running context. A 30-minute customer support session with moderate tool use can exceed 32K and force external state management, which adds latency and architectural complexity.
The 128K window makes 45–60 minute sessions practical without context stitching. For healthcare intake conversations, extended enterprise support workflows, and tutoring or coaching sessions, this is the change that makes the Realtime API production-viable without custom memory scaffolding. If you have been maintaining a separate summary-and-reinject loop to keep costs down, you can simplify or remove it entirely.
Configurable Reasoning Effort
GPT-Realtime-2 supports a reasoning_effort parameter with values low, medium, and high. This directly controls both latency and cost:
low: fastest response, minimal internal chain-of-thought. Best for FAQ-style queries, simple lookups, and conversational back-and-forth with no tool calls.medium: balanced — the default. Handles tool use and moderate task complexity reliably.high: full reasoning chain before responding. Use when correctness on complex multi-step logic matters more than response speed — medical triage, financial calculations, legal reasoning.
For most deployments, routing simple turns to low and tool-heavy turns to medium cuts costs substantially without degrading quality where it matters. This is the same effort-routing principle covered in depth in the guide on managing agentic AI infrastructure costs.
Building a Voice Agent: Code Walkthrough
The Realtime API uses WebSockets. Here is a minimal but production-capable TypeScript implementation that handles session setup, tool calls, and audio streaming:
import WebSocket from 'ws'
const OPENAI_API_KEY = process.env.OPENAI_API_KEY!
async function createVoiceAgent() {
const ws = new WebSocket('wss://api.openai.com/v1/realtime', {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
'OpenAI-Beta': 'realtime=v2',
},
})
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: {
model: 'gpt-realtime-2',
modalities: ['audio', 'text'],
reasoning_effort: 'medium',
instructions: 'You are a helpful assistant. Keep responses concise.',
tools: [
{
type: 'function',
name: 'get_weather',
description: 'Get current weather for a location',
parameters: {
type: 'object',
properties: {
location: { type: 'string', description: 'City and country' },
},
required: ['location'],
},
},
],
tool_choice: 'auto',
turn_detection: {
type: 'server_vad',
threshold: 0.5,
silence_duration_ms: 800,
},
},
}))
})
ws.on('message', (raw: Buffer) => {
const event = JSON.parse(raw.toString()) as Record<string, unknown>
handleEvent(event, ws)
})
}
function handleEvent(event: Record<string, unknown>, ws: WebSocket) {
switch (event.type) {
case 'response.audio.delta':
streamAudioChunk(event.delta as string)
break
case 'response.function_call_arguments.done':
executeTool(
event.name as string,
JSON.parse(event.arguments as string),
event.call_id as string,
ws
)
break
case 'error':
console.error('Realtime error:', event.error)
break
}
}
async function executeTool(
name: string,
args: Record<string, unknown>,
callId: string,
ws: WebSocket
) {
let result: unknown = null
if (name === 'get_weather') {
result = await fetchWeather(args.location as string)
}
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: { type: 'function_call_output', call_id: callId, output: JSON.stringify(result) },
}))
ws.send(JSON.stringify({ type: 'response.create' }))
}
declare function streamAudioChunk(delta: string): void
declare function fetchWeather(location: string): Promise<unknown>
createVoiceAgent()
Two details worth noting: the OpenAI-Beta: realtime=v2 header is required for GPT-Realtime-2 features — without it the endpoint falls back to v1 behavior and parallel tool calls will not work. The server_vad turn detection lets the API detect when the user has finished speaking without you managing silence thresholds client-side.
GPT-Realtime-Translate: Live Speech-to-Speech Translation
GPT-Realtime-Translate handles a specific and previously under-served workflow: real-time spoken translation where the translated speech keeps pace with the original speaker rather than lagging behind in batch segments.
The model supports 70+ input languages and translates into 13 output languages. Output is synthesized audio in the target language with natural prosody — not a TTS read-out of a text translation. That distinction matters for user experience: a model that produces fluent spoken output in the target language feels qualitatively different from text passed through generic TTS.
ws.send(JSON.stringify({
type: 'session.update',
session: {
model: 'gpt-realtime-translate',
modalities: ['audio'],
translation: {
input_language: 'auto',
output_language: 'es',
},
voice: 'alloy',
},
}))
Setting input_language to auto enables automatic language detection per session. For deployments where the input language is known (a French-to-English support line, for example), specifying it explicitly reduces latency by skipping the detection step. The 13 output languages include English, Spanish, French, German, Japanese, Portuguese, Italian, Dutch, Korean, Polish, Russian, Chinese (Simplified), and Arabic.
Practical use cases that are now cost-viable with this model: multilingual customer support without per-language agent variants, international voice agent deployments from a single codebase, and live event interpretation at the edge without a dedicated interpreter pool for lower-volume language pairs.
GPT-Realtime-Whisper: Streaming Transcription
GPT-Realtime-Whisper addresses a gap that forced many teams into fragile hybrid architectures. Previous voice pipelines had to choose between batch transcription (accurate but delayed) and real-time streaming alternatives with lower accuracy and complex integration paths. GPT-Realtime-Whisper streams partial transcripts as the speaker talks at accuracy comparable to Whisper Large V3 batch output.
This enables lower-latency hybrid workflows: display a live transcript in a UI while the voice agent is responding, run keyword-based routing decisions in real time, and log structured conversation data without waiting for each turn to complete. For customer support deployments, a live transcript makes it practical to surface agent-assist recommendations to a human supervisor watching the session — a pattern that significantly reduces escalation rates in enterprise deployments.
Pricing and Cost Modeling
Published API pricing for these models as of May 2026:
- GPT-Realtime-2: $40/hour for input audio, $80/hour for output audio at
mediumreasoning effort.loweffort reduces input cost by approximately 40%. - GPT-Realtime-Translate: $20/hour input, $40/hour for translated audio output.
- GPT-Realtime-Whisper: $6/hour for streaming transcription.
For a customer support deployment with an 8-minute average call duration using GPT-Realtime-2 at medium effort, per-call cost runs approximately $0.05–$0.11 depending on the input-to-output audio ratio. Routing simpler calls to low effort consistently brings this below $0.04. At 10,000 calls per month, the difference between static medium and effort-routed sessions is roughly $700/month on inference alone — before accounting for latency improvements that reduce average handle time.
Migrating from the Original Realtime API
If you are running gpt-4o-realtime-preview, migrating to GPT-Realtime-2 requires three targeted changes:
- Update the model ID to
gpt-realtime-2in your session configuration. - Add the v2 header. Include
OpenAI-Beta: realtime=v2in your WebSocket connection headers. The v2 endpoint handles parallel tool calls and the new reasoning parameters — omitting it returns a 400 error for v2-only features. - Review your context budget. If you built external context management to work around the 32K limit, you can simplify that logic — but do not remove it untested. Long-session workflows that never reset the conversation can still approach 128K on complex, tool-heavy calls.
The event schema is backwards compatible with the original Realtime API for core message types (response.audio.delta, conversation.item.create, session.update). New event types for parallel tool calls and reasoning progress are additive — existing event handlers will not break when you upgrade the model ID.
What to Build First
These three models together close the gaps that have kept voice AI in proof-of-concept status at most organizations. GPT-Realtime-2’s reliability on multi-step tool chains makes it viable for healthcare intake, financial services, and legal workflows where previous voice models had error rates too high to trust without mandatory human review. GPT-Realtime-Translate enables global deployments without per-language model engineering or prompt localization. GPT-Realtime-Whisper makes real-time supervision and logging practical without architectural workarounds.
The highest-ROI first deployment for most organizations is customer support deflection. A voice agent built on GPT-Realtime-2 that handles the predictable 40–60% of support volume — account lookups, status checks, scheduling, FAQ — will show measurable deflection rates within 30 days of production deployment. Start there, instrument it with a live transcript from GPT-Realtime-Whisper for quality monitoring, then expand to higher-complexity query categories as the baseline matures.
For teams building more complex agent architectures, parallel tool calling is the capability to prototype first. It removes the sequential bottleneck that made earlier voice agents feel slow on anything beyond simple queries and is the foundation of voice experiences that feel genuinely fast rather than impressive but sluggish. Pair it with the observability patterns from the AI agent monitoring guide before going to production — voice failures are harder to debug than text failures, and structured logging pays off disproportionately in voice pipelines.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.