xAI released Grok Voice Think Fast 1.0 on April 27, 2026 — a dedicated voice agent model built for real-time, two-way conversations that can reason, use tools, and complete multi-step workflows without adding latency to audio responses. Combined with the standalone Grok Speech-to-Text and Text-to-Speech APIs launched earlier in April, xAI now offers a complete voice stack that outperforms ElevenLabs, Deepgram, AssemblyAI, and OpenAI Realtime on key benchmark tasks. This guide covers the architecture, endpoints, pricing, quickstart code, and the specific scenarios where Grok Voice wins.
What Is Grok Voice Think Fast 1.0?
Grok Voice Think Fast 1.0 is xAI’s first dedicated voice agent model — distinct from the Grok 4.3 text model that has incidental voice capabilities. The key engineering difference is its background reasoning architecture: the model runs reasoning on a separate compute path from audio generation, which means it can think through complex, multi-step problems in real time without inserting pauses into the audio stream. Users hear no difference between a simple FAQ lookup and a complex multi-tool orchestration. Latency stays sub-second across both.
The model was built through collaboration between xAI and Starlink, whose customer support infrastructure processes millions of real-world voice interactions daily across languages and network conditions. That production environment shaped what matters in voice AI: accent robustness, structured data extraction (account numbers, addresses, phone numbers), multi-turn context retention, and graceful handling of overlapping speech.
On the Tau Voice Bench — the standard multi-step voice agent evaluation framework developed by Salesforce Research — Grok Voice Think Fast 1.0 achieves the top score as of April 2026, ahead of GPT-4o Realtime, Gemini 2.0 Flash Live, and ElevenLabs Conversational AI. The margin is largest on tasks requiring both understanding and sequential tool-driven action in a single conversation turn.
The Three-Layer Voice API Stack
xAI’s voice offering is organized into three independent layers. Understanding which layer you need determines your architecture and cost model. Each can be used on its own or composed together.
Layer 1: Speech-to-Text API
The Grok STT API supports batch and streaming transcription across 25 languages. It is the right component for use cases where you need transcription without a conversational response — meeting notes, call recording, voice command capture, accessibility tooling.
- Batch transcription: $0.10 per hour of audio
- Streaming transcription: $0.20 per hour of audio
- Supports 12 audio formats: MP3, WAV, FLAC, Opus, M4A, WebM, and more
- Speaker diarization (who said what) included at no extra charge
- Word-level timestamps for caption alignment and edit tooling
- Inverse Text Normalization (ITN): spoken “three hundred dollars” becomes “$300” in the transcript
- Multichannel audio support for phone calls recorded on two tracks
On phone call entity recognition benchmarks, Grok STT achieves a 5.0% word error rate on structured entities — account numbers, addresses, full names — compared to ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. For applications where errors in extracted data cause downstream failures (wrong order updates, incorrect shipping addresses, misrouted support tickets), this gap is production-relevant.
The batch endpoint accepts a multipart form upload; the streaming endpoint uses a WebSocket connection with audio chunks arriving as Base64-encoded PCM. Both endpoints return a JSON transcript with optional word-level timestamps and diarization segments.
Layer 2: Text-to-Speech API
The Grok TTS API converts text to natural-sounding audio with fine-grained expressiveness controls. Pricing is $4.20 per million characters, which positions it near ElevenLabs Creator tier but with pure pay-as-you-go billing and no subscription requirement.
Five voices are available at launch spanning American English, British English, and a neutral AI voice profile. The most developer-friendly feature is inline speech tags. Rather than training a custom voice or navigating SSML, you embed tags directly in the input text string:
const response = await xai.audio.speech.create({
model: "grok-tts-1",
voice: "ember",
input: "Let me check your account. [pause] Okay, I found it. Your balance is <emphasis>two hundred dollars</emphasis>."
})
Supported inline tags include [laugh], [sigh], [whisper], [gasp], [clearing throat], and wrapping tags like <emphasis>, <slow>, <fast>, and <pause duration="500ms">. For customer support and sales, these controls close the uncanny valley between robotic TTS and natural speech without requiring custom voice cloning or model fine-tuning.
Output formats include MP3 (high fidelity), WAV (uncompressed), and μ-law (telephony-optimized for PSTN/SIP pipelines at 8 kHz). The μ-law format matters for developers integrating with Twilio, Vonage, or any carrier SIP trunk where bandwidth and codec compatibility constrain format options.
The TTS API supports 20+ languages and handles multilingual text within a single request — the model switches pronunciation rules mid-string when it detects a language boundary, which is useful for international products where product names or legal disclaimers appear in a different language from the surrounding speech.
Layer 3: Voice Agent API (Grok Voice Think Fast 1.0)
The Voice Agent API combines STT, reasoning, tool use, and TTS into a single WebSocket session. This is the correct layer for any use case involving live back-and-forth conversation with real-time decision-making: customer support, sales qualification, appointment scheduling, technical troubleshooting, or any workflow where the AI needs to take action during the conversation.
The WebSocket endpoint is:
wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0
After the handshake, send a session configuration event to declare your tools, voice, and instructions:
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "atlas",
instructions: "You are a customer support agent for Acme Corp. Use the order_lookup tool to find orders by ID.",
tools: [
{
type: "function",
name: "order_lookup",
description: "Look up an order by order ID",
parameters: {
type: "object",
properties: { order_id: { type: "string" } },
required: ["order_id"]
}
}
]
}
}))
Stream audio input as Base64-encoded PCM chunks. Receive audio output in real time as response.audio.delta events. Tool calls arrive as response.function_call events — your server executes the function and sends the result back as a conversation.item.create message. The model resumes audio generation from where it left off, with no silence in the audio stream while the tool call was in flight.
Pricing and Session Limits
Grok Voice Think Fast 1.0 is priced at $0.05 per minute of voice session time. The current tier includes 100 concurrent sessions per team and a 30-minute maximum session duration. Tool calls are billed separately when the model invokes built-in xAI tools (web search, X search, MCP-backed integrations) at $0.002 per call. Custom server-side functions — where your code runs the function and returns the result — are not billed beyond the base minute rate.
Compared to OpenAI Realtime API, which charges $0.06 per minute for input audio on gpt-4o-realtime-preview, Grok Voice is 17% cheaper per minute at the base rate. For teams running high-volume voice workloads, xAI offers enterprise pricing through direct negotiation at the x.ai/api console, which removes the concurrent session cap and extends session durations.
The 30-minute session cap is the most significant architectural constraint for long-duration workflows. Phone support calls rarely exceed this threshold, but complex agentic workflows — multi-stage insurance intake, extended technical onboarding, tax filing walkthroughs — may need session resumption logic. xAI recommends generating a context summary at 25 minutes and opening a new session with that summary as the system prompt preamble.
Built-In Tool Use: X Search and Web Search
One feature that distinguishes Grok Voice Agent from OpenAI Realtime is native access to xAI’s X search and web search inside live voice sessions. You enable these with a single tool declaration — no custom retrieval pipeline required:
session: {
tools: [
{ type: "web_search" },
{ type: "x_search" }
]
}
This means your voice agent can answer questions about current events, real-time prices, breaking news, or any topic that changes faster than a training cutoff, without you building and hosting a retrieval backend. For product categories like market intelligence, live sports results, weather-aware scheduling, or news briefings, this dramatically reduces the engineering surface needed to ship a production voice product.
MCP Integration for Enterprise Data Sources
Grok Voice Agent supports Model Context Protocol (MCP)-backed tools in voice sessions, which means you can connect a voice agent to any MCP server — internal databases, CRMs, ticketing systems, e-commerce APIs — using the same tool configuration syntax as the HTTP API.
The voice agent calls your MCP server in real time during a live session. Because the background reasoning architecture separates audio generation from tool execution, the model continues its reasoning about what to do next while the MCP round-trip is completing. From the caller’s perspective, there is no pause. MCP tool calls inside voice sessions are billed at $0.002 per call, the same rate as HTTP API tool calls, making budget modeling predictable for enterprise integrations.
Use Cases Where Grok Voice Think Fast 1.0 Has a Genuine Advantage
Based on the architecture and benchmarks, three categories of use case see the largest gains over competing voice stacks:
Customer Support Automation: The 5.0% entity extraction error rate (vs. 12–21% for competitors) makes Grok STT the right choice for support workflows where errors in extracted data have direct downstream consequences. Combine with the Voice Agent API for end-to-end automation of support queues that currently require human agents.
Multi-Turn Sales and Scheduling: Background reasoning enables the model to handle complex objections, multi-step qualification sequences, and scheduling logic — including tool calls to calendar APIs and CRM lookups — without unnatural pauses. The Starlink collaboration suggests the model was specifically tuned for sales workflows where users are often distracted, speak in fragments, or change direction mid-conversation.
Technical Troubleshooting with Live Documentation Lookup: The combination of background reasoning, tool use, and built-in web search makes this the strongest current option for technical support scenarios. The model can diagnose an issue, look up current documentation, and walk a user through steps simultaneously — all in a live voice session without the caller hearing the model “going to check.”
Quickstart: Minimal Node.js Voice Agent
The following Node.js example sets up a voice session that handles a balance inquiry using a custom function tool. It uses the ws npm package for the WebSocket connection:
import WebSocket from 'ws'
const ws = new WebSocket(
'wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0',
{ headers: { Authorization: `Bearer ${process.env.XAI_API_KEY}` } }
)
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: {
voice: 'ember',
instructions: 'You are a banking assistant. Help with balance inquiries.',
tools: [{
type: 'function',
name: 'get_balance',
description: 'Return account balance',
parameters: {
type: 'object',
properties: { account_number: { type: 'string' } },
required: ['account_number']
}
}]
}
}))
})
ws.on('message', (data) => {
const event = JSON.parse(data.toString())
if (event.type === 'response.function_call' && event.name === 'get_balance') {
const { account_number } = JSON.parse(event.arguments)
console.log('Looking up account:', account_number)
ws.send(JSON.stringify({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: event.call_id,
output: JSON.stringify({ balance: '$1,240.55', currency: 'USD' })
}
}))
}
if (event.type === 'response.audio.delta') {
// Pipe event.delta (Base64 PCM) to your audio output stream
}
})
Full documentation covering session resumption, multi-party conferencing (up to four participants per session, in beta), and Twilio/SIP integration patterns is available in the xAI developer docs under the Voice API section.
What the Complete Voice Stack Means for Developers
Six months ago, building a production voice agent required stitching together three vendors: a transcription API, a language model API, and a TTS provider. Each introduced latency, a billing relationship, and potential points of failure. xAI now offers a single-vendor solution that outperforms the best multi-vendor stacks on the primary benchmark.
The practical implication is architectural simplification. Fewer API round-trips, a single auth token, and a unified session model replace the coordination overhead of multi-vendor pipelines. For founders building in voice AI — customer support, sales automation, healthcare intake, accessibility tools, language learning — the barrier to a production-quality MVP has dropped significantly this week.
The remaining gap to watch is language coverage: 25 languages in the STT API versus ElevenLabs’ 32 and Google Cloud Speech’s 125. For multilingual enterprise use cases, a hybrid approach — Grok Voice for English and high-priority languages, a specialist provider for the long tail — may still make sense until xAI expands coverage. xAI has indicated language additions will arrive monthly through Q3 2026, so the gap should close by mid-year.
Track the release notes at docs.x.ai/developers/release-notes for language additions, multi-party conferencing GA, and the enterprise tier pricing announcement that xAI has flagged as coming in Q2 2026.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo Β· Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments Β· 0
No comments yet. Be the first to share your thoughts.