On April 17, 2026, xAI dropped Grok 4.3 Beta for SuperGrok Heavy subscribers — and it landed with three capabilities that change how developers should evaluate the platform. The update adds native document generation (PDFs, spreadsheets, and PowerPoint decks directly from conversation), conversational video understanding, and a pair of standalone voice APIs — Speech-to-Text and Text-to-Speech — with pricing that undercuts OpenAI by 86% and ElevenLabs by 92%. If you have been watching xAI from the sidelines, this release is worth a closer look. If you are actively evaluating voice API providers or building document-generation workflows, it may shift your shortlist entirely. Here is a complete developer breakdown of what changed, what the APIs look like, and where the real gaps remain.
What Is Grok 4.3 Beta and Who Has Access
Grok 4.3 Beta is xAI’s latest model update, currently gated behind the SuperGrok Heavy subscription tier at $300 per month — the most expensive consumer AI plan currently on the market. Full rollout to lower tiers is expected in mid-to-late May 2026.
The $300 price point bundles higher rate limits, priority compute access, and the document and voice features described below. For developers who want access to the standalone voice APIs without a SuperGrok subscription, the STT and TTS APIs are available directly through the xAI API console and billed per-use independently of any subscription tier.
One important caveat: xAI has shared minimal official release documentation for this update. The feature descriptions and benchmarks below are sourced from the official xAI developer docs, the xAI news announcement pages, and secondary technical coverage. As with most early beta releases, independent production verification is advisable before committing to a migration at scale.
Document Generation: PDFs, Spreadsheets, and Slide Decks From Conversation
The most immediately practical feature for business users is Grok 4.3’s ability to generate downloadable, formatted documents directly inside a conversation. This is not a “here is the text, paste it yourself” workflow — the model generates complete output files you can share or submit without additional editing.
Three output types are supported at launch:
- PDF reports: Formatted business reports, research summaries, and client-facing documents with headers, sections, and basic layout. Useful for competitive analysis outputs, meeting briefings, and structured summaries that need to look professional out of the box.
- Excel-compatible spreadsheets: Fully populated tables with data, formulas, and structure ready to open in Excel or Google Sheets. This is particularly useful for financial models, comparison tables, and structured data outputs where manually reformatting an LLM response into a spreadsheet has historically cost significant time.
- PowerPoint-compatible presentations: Slide decks with title slides, content slides, and speaker notes, ready to open in PowerPoint or Google Slides. Building a slide deck from an AI conversation has previously required either a dedicated tool like Gamma or manual reformatting; Grok 4.3 handles this natively in the same context as the LLM.
To use document generation, describe the document you need and explicitly request a downloadable file in your prompt. For example: “Write a competitive analysis covering three SaaS pricing page strategies. Output as a formatted PDF with an executive summary and one section per competitor.” Grok 4.3 will generate and attach the file directly in the conversation thread.
This puts Grok 4.3 in direct competition with Gamma, Notion AI, and Microsoft Copilot’s document generation features — but embedded in the same context as a frontier LLM conversation rather than requiring a separate tool or context switch.
Native Video Understanding
Grok 4.3 adds native video input, bringing xAI into a space that Google Gemini 2.5 Pro and GPT-4o already occupy. You can upload a video clip and interact with it conversationally: the model understands visual content, motion, and temporal context across frames.
Practical use cases include:
- Meeting recordings: Upload a Zoom or Meet recording and ask for a structured summary with action items, decisions made, and follow-up owners.
- Product demo analysis: Feed in a competitor product video and ask for a feature breakdown, UX observations, or gap analysis against your own product.
- Tutorial compression: Summarize a lengthy technical walkthrough into numbered step-by-step instructions, timestamped for reference.
- Content extraction: Pull text visible in video frames — slides, screen recordings, whiteboard sessions — without needing a separate OCR or transcription pipeline.
Supported formats include MP4, MOV, and WebM. Video input is available in both the Grok chat interface and through the API for developers building media-processing pipelines.
Grok Speech-to-Text API: Developer Guide
The most technically significant addition in Grok 4.3 is the standalone Grok STT API. xAI describes it as “the same stack that powers Grok Voice, Tesla vehicle infotainment, and Starlink customer support” — a lineage that suggests production-grade reliability has already been stress-tested at significant scale.
Key Features
- 25+ language support: Full multilingual transcription with automatic language detection. No per-language model switching required.
- 12 audio formats: MP3, WAV, FLAC, OGG, M4A, WEBM, and more — covering the full range of consumer and enterprise audio sources.
- Word-level timestamps: Every transcribed word includes precise start and end timestamps. Essential for caption generation, searchable video, synchronization pipelines, and audio alignment in media production workflows.
- Speaker diarization: Distinguishes between multiple speakers in a recording and tags each word with a speaker identifier. Critical for meeting transcription, interview processing, and call center analytics.
- Multichannel audio support: Processes stereo and multi-channel audio with per-channel transcription options, useful for call recording systems that separate caller and agent onto different channels.
- Inverse Text Normalization (ITN): Converts spoken-form numbers and abbreviations into standard written form. “Twenty-two thousand dollars” becomes “$22,000” in the output. Eliminates a post-processing step that every other STT pipeline requires separately.
API Endpoints and Quickstart
Two modes are available. Batch transcription uses a standard REST endpoint and returns a full transcript with timestamps and speaker labels once processing completes. Realtime streaming uses a WebSocket connection and pushes partial transcripts as audio arrives — the right choice for live captioning, real-time meeting notes, and voice agent interfaces where latency matters.
Batch (REST): POST https://api.x.ai/v1/stt
Realtime (WSS): wss://api.x.ai/v1/stt
Authentication uses the standard xAI API key in the Authorization: Bearer header, consistent with the rest of the xAI API surface.
Pricing and Benchmark Comparison
Pricing is straightforward: $0.10 per hour for batch processing and $0.20 per hour for streaming. There are no per-request fees, per-language surcharges, or feature tier add-ons.
On phone call entity recognition error rate — a standard STT quality benchmark for business audio — xAI reports the following comparison from their internal testing:
- Grok STT: 5.0%
- ElevenLabs: 12.0%
- Deepgram: 13.5%
- AssemblyAI: 21.3%
These are xAI’s own figures, so treat them as directional rather than independent benchmarks. That said, even applying significant skepticism, the pricing advantage is not self-reported: at $0.10 per hour batch, Grok STT is competitive with or cheaper than all three named competitors on list pricing. For a workflow processing 1,000 hours of audio per month, the cost difference between Grok and Deepgram can run to hundreds of dollars monthly.
Grok Text-to-Speech API: Developer Guide
The Grok TTS API converts text to speech at pricing that significantly undercuts the current market leaders. At $4.20 per million characters, it costs approximately 86% less than OpenAI TTS (~$30/1M) and 92% less than ElevenLabs (~$50/1M).
For a voice application generating 10 million characters per month, that maps to approximately $42/month on Grok versus $300/month on OpenAI or $500/month on ElevenLabs. The economics are difficult to ignore for high-volume voice applications.
Voices and Language Support
Five voices are available at launch, designed to cover a range of professional contexts:
- Ara — Female, neutral professional tone. Strong default for enterprise-facing applications.
- Eve — Female, warmer and more expressive. Well-suited for consumer-facing interfaces and conversational agents.
- Leo — Male, measured and authoritative. Appropriate for news reading, legal documents, and formal narration.
- Rex — Male, energetic. Better for promotional content, tutorials, and dynamic narration.
- Sal — Gender-neutral, optimized for accessibility contexts and screen reader applications.
All five voices support 20+ languages with automatic language detection based on input text. No per-language voice selection is required for multilingual applications.
Expressive Speech Tags
One of the more developer-friendly features in Grok TTS is inline speech tag support — markers embedded directly in the input text that control vocal delivery without requiring a separately fine-tuned model or manual audio post-processing:
[laugh]— Inserts a natural laugh at that position in the audio[sigh]— Adds an audible sigh[whisper]text[/whisper]— Renders the enclosed text in a whispered delivery[pause]— Inserts a natural pause in the audio flow
This gives developers fine-grained control over emotional tone and pacing that would otherwise require either manual audio production or a separate voice acting pipeline. For conversational AI agents, character voices in interactive media, and podcast-style content generation, the practical value is significant.
API Endpoint
POST https://api.x.ai/v1/tts
The request body takes the input text string, the target voice name, and the desired output format (MP3, WAV, or OPUS). The response is the audio file binary, ready to stream or store.
How Grok 4.3 Compares to GPT-4o and Claude
Against the two dominant frontier models, Grok 4.3 has genuine strengths in some areas and real gaps in others.
Document generation: GPT-4o with Code Interpreter can produce downloadable files but requires users to carefully construct file-format instructions. Claude 3.7 Sonnet has no native downloadable file generation in its chat interface as of April 2026. Grok 4.3’s implementation is cleaner for business document workflows that need to output a finished file rather than formatted text.
Video understanding: GPT-4o and Gemini 2.5 Pro handle video natively. Claude currently does not accept video input. Grok 4.3 enters an already-competitive space here rather than a greenfield one.
Voice APIs: Neither OpenAI’s Whisper and TTS stack nor ElevenLabs match Grok’s pricing. Deepgram and AssemblyAI are the strongest feature competitors for STT, but lose on price. For high-volume voice applications, Grok enters the market as the cheapest credible option with competitive accuracy claims.
Persistent memory: Grok 4.3 still lacks persistent memory between sessions — a capability ChatGPT addressed two years ago and Claude fills with Projects. At $300 per month, this is the hardest omission to defend. For any use case where users expect the assistant to remember prior conversations, this is a real limitation that no amount of pricing advantage on voice APIs compensates for.
Who Should Use Grok 4.3 Beta Right Now
Given the $300/month gate and the beta status, the practical audience for Grok 4.3 today is narrower than xAI likely intends:
- Voice application developers evaluating STT/TTS providers for production pipelines. The standalone APIs are accessible without a SuperGrok subscription and the pricing difference justifies a benchmark run against your current provider.
- Agencies and consultants building document-generation workflows for clients where the ability to output a formatted PDF or slide deck from a conversation has direct workflow value.
- Teams processing meeting recordings who need speaker diarization, transcription, and summarization in one integrated system.
- Enterprises already using xAI APIs who want to consolidate voice infrastructure onto a single provider relationship.
For individual developers and smaller teams, the $300/month bar is prohibitive for testing. The practical entry point is the standalone voice APIs, which are separately priced and offer the highest ROI relative to current alternatives.
What to Watch For in the Full Rollout
The full Grok 4.3 rollout to lower-tier subscribers is expected in mid-to-late May 2026. Several things are worth watching as that happens: independent STT benchmark validation against the xAI-reported figures, whether persistent memory gets added before or alongside the wider rollout, and how document generation holds up for complex multi-section formats at production scale.
For the voice APIs specifically, the most useful data point will be independent accuracy testing on domain-specific vocabulary — medical, legal, and technical terminology are where STT systems diverge most from headline benchmark figures, and those are exactly the use cases that drive enterprise voice API adoption.
Grok 4.3 Beta is the most consequential xAI model update since Grok 4 launched. The voice API pricing alone restructures the economics of voice AI at scale, and document generation closes a practical gap that competing platforms haven’t addressed as cleanly. The $300/month gate is a real barrier for most developers right now — but the standalone voice APIs are available today and worth benchmarking against your current stack before the full rollout changes the conversation entirely.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 1,800+ premium dev tools, prompt packs, and templates.
Comments · 0
No comments yet. Be the first to share your thoughts.