Native Video Understanding
Grok 4.3 adds native video input, bringing xAI into a space that Google Gemini 2.5 Pro and GPT-4o already occupy. You can upload a video clip and interact with it conversationally: the model understands visual content, motion, and temporal context across frames.
Practical use cases include:
- Meeting recordings: Upload a Zoom or Meet recording and ask for a structured summary with action items, decisions made, and follow-up owners.
- Product demo analysis: Feed in a competitor product video and ask for a feature breakdown, UX observations, or gap analysis against your own product.
- Tutorial compression: Summarize a lengthy technical walkthrough into numbered step-by-step instructions, timestamped for reference.
- Content extraction: Pull text visible in video frames — slides, screen recordings, whiteboard sessions — without needing a separate OCR or transcription pipeline.
Supported formats include MP4, MOV, and WebM. Video input is available in both the Grok chat interface and through the API for developers building media-processing pipelines.
Grok Speech-to-Text API: Developer Guide
The most technically significant addition in Grok 4.3 is the standalone Grok STT API. xAI describes it as “the same stack that powers Grok Voice, Tesla vehicle infotainment, and Starlink customer support” — a lineage that suggests production-grade reliability has already been stress-tested at significant scale.
Key Features
- 25+ language support: Full multilingual transcription with automatic language detection. No per-language model switching required.
- 12 audio formats: MP3, WAV, FLAC, OGG, M4A, WEBM, and more — covering the full range of consumer and enterprise audio sources.
- Word-level timestamps: Every transcribed word includes precise start and end timestamps. Essential for caption generation, searchable video, synchronization pipelines, and audio alignment in media production workflows.
- Speaker diarization: Distinguishes between multiple speakers in a recording and tags each word with a speaker identifier. Critical for meeting transcription, interview processing, and call center analytics.
- Multichannel audio support: Processes stereo and multi-channel audio with per-channel transcription options, useful for call recording systems that separate caller and agent onto different channels.
- Inverse Text Normalization (ITN): Converts spoken-form numbers and abbreviations into standard written form. “Twenty-two thousand dollars” becomes “$22,000” in the output. Eliminates a post-processing step that every other STT pipeline requires separately.
API Endpoints and Quickstart
Two modes are available. Batch transcription uses a standard REST endpoint and returns a full transcript with timestamps and speaker labels once processing completes. Realtime streaming uses a WebSocket connection and pushes partial transcripts as audio arrives — the right choice for live captioning, real-time meeting notes, and voice agent interfaces where latency matters.
Batch (REST): POST https://api.x.ai/v1/stt
Realtime (WSS): wss://api.x.ai/v1/stt
Authentication uses the standard xAI API key in the Authorization: Bearer header, consistent with the rest of the xAI API surface.
Pricing and Benchmark Comparison
Pricing is straightforward: $0.10 per hour for batch processing and $0.20 per hour for streaming. There are no per-request fees, per-language surcharges, or feature tier add-ons.
On phone call entity recognition error rate — a standard STT quality benchmark for business audio — xAI reports the following comparison from their internal testing:
- Grok STT: 5.0%
- ElevenLabs: 12.0%
- Deepgram: 13.5%
- AssemblyAI: 21.3%
These are xAI’s own figures, so treat them as directional rather than independent benchmarks. That said, even applying significant skepticism, the pricing advantage is not self-reported: at $0.10 per hour batch, Grok STT is competitive with or cheaper than all three named competitors on list pricing. For a workflow processing 1,000 hours of audio per month, the cost difference between Grok and Deepgram can run to hundreds of dollars monthly.
Grok Text-to-Speech API: Developer Guide
The Grok TTS API converts text to speech at pricing that significantly undercuts the current market leaders. At $4.20 per million characters, it costs approximately 86% less than OpenAI TTS (~$30/1M) and 92% less than ElevenLabs (~$50/1M).
For a voice application generating 10 million characters per month, that maps to approximately $42/month on Grok versus $300/month on OpenAI or $500/month on ElevenLabs. The economics are difficult to ignore for high-volume voice applications.
Voices and Language Support
Five voices are available at launch, designed to cover a range of professional contexts:
- Ara — Female, neutral professional tone. Strong default for enterprise-facing applications.
- Eve — Female, warmer and more expressive. Well-suited for consumer-facing interfaces and conversational agents.
- Leo — Male, measured and authoritative. Appropriate for news reading, legal documents, and formal narration.
- Rex — Male, energetic. Better for promotional content, tutorials, and dynamic narration.
- Sal — Gender-neutral, optimized for accessibility contexts and screen reader applications.
All five voices support 20+ languages with automatic language detection based on input text. No per-language voice selection is required for multilingual applications.
Expressive Speech Tags
One of the more developer-friendly features in Grok TTS is inline speech tag support — markers embedded directly in the input text that control vocal delivery without requiring a separately fine-tuned model or manual audio post-processing:
[laugh] — Inserts a natural laugh at that position in the audio
[sigh] — Adds an audible sigh
[whisper]text[/whisper] — Renders the enclosed text in a whispered delivery
[pause] — Inserts a natural pause in the audio flow
This gives developers fine-grained control over emotional tone and pacing that would otherwise require either manual audio production or a separate voice acting pipeline. For conversational AI agents, character voices in interactive media, and podcast-style content generation, the practical value is significant.
API Endpoint
POST https://api.x.ai/v1/tts
The request body takes the input text string, the target voice name, and the desired output format (MP3, WAV, or OPUS). The response is the audio file binary, ready to stream or store.
How Grok 4.3 Compares to GPT-4o and Claude
Against the two dominant frontier models, Grok 4.3 has genuine strengths in some areas and real gaps in others.
Document generation: GPT-4o with Code Interpreter can produce downloadable files but requires users to carefully construct file-format instructions. Claude 3.7 Sonnet has no native downloadable file generation in its chat interface as of April 2026. Grok 4.3’s implementation is cleaner for business document workflows that need to output a finished file rather than formatted text.
Video understanding: GPT-4o and Gemini 2.5 Pro handle video natively. Claude currently does not accept video input. Grok 4.3 enters an already-competitive space here rather than a greenfield one.
Voice APIs: Neither OpenAI’s Whisper and TTS stack nor ElevenLabs match Grok’s pricing. Deepgram and AssemblyAI are the strongest feature competitors for STT, but lose on price. For high-volume voice applications, Grok enters the market as the cheapest credible option with competitive accuracy claims.
Persistent memory: Grok 4.3 still lacks persistent memory between sessions — a capability ChatGPT addressed two years ago and Claude fills with Projects. At $300 per month, this is the hardest omission to defend. For any use case where users expect the assistant to remember prior conversations, this is a real limitation that no amount of pricing advantage on voice APIs compensates for.
Who Should Use Grok 4.3 Beta Right Now
Given the $300/month gate and the beta status, the practical audience for Grok 4.3 today is narrower than xAI likely intends:
- Voice application developers evaluating STT/TTS providers for production pipelines. The standalone APIs are accessible without a SuperGrok subscription and the pricing difference justifies a benchmark run against your current provider.
- Agencies and consultants building document-generation workflows for clients where the ability to output a formatted PDF or slide deck from a conversation has direct workflow value.
- Teams processing meeting recordings who need speaker diarization, transcription, and summarization in one integrated system.
- Enterprises already using xAI APIs who want to consolidate voice infrastructure onto a single provider relationship.
For individual developers and smaller teams, the $300/month bar is prohibitive for testing. The practical entry point is the standalone voice APIs, which are separately priced and offer the highest ROI relative to current alternatives.
What to Watch For in the Full Rollout
The full Grok 4.3 rollout to lower-tier subscribers is expected in mid-to-late May 2026. Several things are worth watching as that happens: independent STT benchmark validation against the xAI-reported figures, whether persistent memory gets added before or alongside the wider rollout, and how document generation holds up for complex multi-section formats at production scale.
For the voice APIs specifically, the most useful data point will be independent accuracy testing on domain-specific vocabulary — medical, legal, and technical terminology are where STT systems diverge most from headline benchmark figures, and those are exactly the use cases that drive enterprise voice API adoption.
Grok 4.3 Beta is the most consequential xAI model update since Grok 4 launched. The voice API pricing alone restructures the economics of voice AI at scale, and document generation closes a practical gap that competing platforms haven’t addressed as cleanly. The $300/month gate is a real barrier for most developers right now — but the standalone voice APIs are available today and worth benchmarking against your current stack before the full rollout changes the conversation entirely.
Comments · 0
No comments yet. Be the first to share your thoughts.