Mira Murati's TML-Interaction-Small hits 0.4s latency — 3x faster than GPT-Realtime-2.0. Full architecture breakdown and what it means for voice agent developers.
GPT-Realtime-2.0 averages 1.18 seconds from the end of your utterance to first audio token. Gemini 3.1 Flash Live clocks 0.57 seconds. Thinking Machines Lab’s first public model — TML-Interaction-Small — hits 0.40 seconds. But the latency number is almost the wrong frame for what Thinking Machines actually shipped.
The architecture Mira Murati’s lab announced on May 11, 2026, is not a faster realtime API. It is a fundamentally different model class. An interaction model does not wait for you to stop talking before it processes what you said. It listens and reasons simultaneously. It can interrupt, redirect, and respond mid-sentence — the same way a human conversation partner does. Every major voice AI system today is half-duplex masquerading as conversation. Thinking Machines shipped something that is architecturally full-duplex.
Here is what this means for voice and multi-modal agent architecture, what FD-bench actually measures, and what developers can — and cannot — build with it right now.
What “Interaction Models” Actually Are
The canonical request-response loop that underlies every AI assistant — user speaks, AI listens, AI processes, AI responds — is a carry-over from text interfaces. Text is naturally turn-based. Audio is not. Human conversation is full-duplex: both parties process incoming signals while simultaneously preparing responses.
The current generation of voice AI works around this by stitching together a voice activity detector (VAD), a transcription model, an LLM, and a TTS pipeline. The stitching introduces latency at every seam. More importantly, the end-to-end system fundamentally cannot respond until your turn ends. The VAD has to detect silence, the transcription model has to convert audio to tokens, and only then does the LLM start processing.
Thinking Machines Lab’s position is that this stitched architecture is not an implementation problem that can be optimized away. The constraint is structural. A model trained on discrete text tokens cannot natively reason over streaming audio — it has to wait for the stream to be transcribed first, and that transcription step is where latency accumulates regardless of how fast each component runs.
Their interaction model is trained to reason over raw streaming audio and video in real time. No transcription preprocessing step. The output is also generated as a native audio stream, not synthesized from text tokens after the fact. Both input and output stay in the continuous-signal domain throughout processing. The model’s context is a rolling temporal window rather than a fixed token buffer.
The Dual-Component Architecture
Thinking Machines describes the system as two coupled components that share full conversation context throughout a session. Understanding how these two components divide responsibility is the key to understanding why the architecture works.
TML-Interaction-Small (foreground model): A 276-billion parameter Mixture-of-Experts architecture with 12 billion parameters active at any inference step. This is the component that maintains presence — it processes incoming audio and video continuously, tracks conversational state, handles turn-taking and interruption detection, and generates the immediate response stream. At 12B active parameters, it runs fast enough to sustain the 200ms micro-turn latency that enables genuine overlap between listening and speaking.
Background Model (asynchronous reasoning): A second, larger model that handles sustained reasoning, tool use, web search, and longer-horizon tasks. When a query requires genuine deliberation — a calculation, a code generation request, looking up current information — the foreground model hands off context to the background model, which works asynchronously. The foreground model maintains the conversational thread (acknowledging, asking clarifying questions, providing partial responses) while the background model works on the harder task.
# Conceptual architecture (from Thinking Machines Lab blog post)
# Source: thinkingmachines.ai/blog/interaction-models/
User audio/video stream
│
▼
TML-Interaction-Small (12B active, MoE)
│
├── Turn-taking, interruption, immediate response (200ms micro-turns)
│
└── Routes complex requests ──────────────────▶ Background Model
│
└── Tool calls, search,
code gen, reasoning
(async, full context)
This separation is the architectural insight. Previous realtime APIs tried to compress both presence and reasoning into a single model under a single latency budget. The result is a forced tradeoff: either the model is fast (shallow reasoning) or thorough (slow to respond). Thinking Machines decoupled the two concerns entirely. The foreground model is optimized for presence and latency. The background model is optimized for reasoning quality. They share context but operate on separate constraints.
The analogy that clarifies this: the foreground model is a skilled conversationalist who keeps the room engaged while a colleague does the research. The colleague’s quality of research is independent of how fast they deliver the answer to the room.
The Benchmarks — What FD-Bench Actually Measures
Thinking Machines published results on FD-bench, a benchmark they developed to measure full-duplex interaction quality. Unlike standard voice benchmarks that measure latency from utterance-end to response-start, FD-bench measures interaction quality across five dimensions: turn-taking latency, interruption handling, overlap tolerance, context retention across interruptions, and response coherence under streaming conditions. The multi-dimensional approach matters because a model can fake low latency by starting to speak before it has processed the input — FD-bench measures whether the response is actually coherent and contextually appropriate, not just fast.
On the latency dimension of FD-bench, TML-Interaction-Small achieved 0.40 seconds. GPT-Realtime-2.0 clocked 1.18 seconds. Gemini 3.1 Flash Live hit 0.57 seconds. The 2.95x gap between TML and GPT-Realtime is not a marginal improvement. At sub-0.4 seconds, the response begins before a human listener registers a conversational pause. The interaction feels synchronous rather than sequential.
The caveat worth stating clearly: FD-bench is a benchmark Thinking Machines designed and ran themselves. No independent replication exists yet. The methodology is published at their interaction-models blog post, but until Hugging Face, Stanford HELM, or LMSYS runs the same evaluation on production systems, these numbers should be treated as directionally compelling rather than definitively proven. What is structurally verifiable from the technical architecture description is that the system handles simultaneous audio output and input — it can begin responding while you are still speaking — which no current production API does at the architectural level.
The practical implications for developers building voice infrastructure: the difference between 1.18s and 0.4s matters most in conversational loops where the delay is perceptible on every exchange. A 10-turn conversation with 1.18s per turn accumulates 11.8 seconds of dead air across the session. At 0.4s, that becomes 4 seconds. For voice-first applications — customer support, coaching tools, companion interfaces, voice agents — this is the difference between a conversation that feels natural and one that constantly reminds users they are talking to a machine.
Mira Murati’s Strategic Position
The founding context around Thinking Machines Lab matters for understanding why this architecture was the first thing they chose to announce. Mira Murati left OpenAI in September 2024 after serving as CTO for six years, during which GPT-4, DALL-E 3, Whisper, and Sora shipped. Thinking Machines raised $2 billion before publishing a single model. The first model they chose to demonstrate publicly is not a reasoning model, a coding model, or a benchmark-topping general intelligence claim — it is a new interaction architecture.
That choice signals a deliberate strategic bet. Every major lab is competing on reasoning benchmarks right now: AIME, HLE, SWE-bench, GPQA Diamond. The frontier on pure reasoning quality is highly contested and expensive to win. The interaction layer — the question of how a model is present in a conversation, not just what it knows — has received comparatively little architectural innovation. Realtime APIs from OpenAI and Google are essentially optimized stitched pipelines. Thinking Machines built from different first principles.
Whether TML-Interaction-Small eventually pairs with a frontier reasoning model determines whether this becomes a meaningful platform shift. The foreground model’s capabilities are detailed and benchmarked. The background model’s reasoning quality — its performance on coding, math, tool use, and extended instruction following — is not yet published. A strong interaction layer on top of a mediocre reasoning layer is a better conversation interface wrapped around weaker answers. The other half of the story is still forthcoming.
What Developers Can Actually Build Right Now
Access to TML-Interaction-Small is gated. The research preview is limited to approved applicants, and the general API is not open. If you are building a voice application today and need production-ready infrastructure, the current options remain the established APIs. See our complete OpenAI GPT Realtime 2.0 developer guide for current production capabilities and pricing, and our multi-model routing guide for how to structure decision logic when the best model for each task changes.
What developers should do right now, regardless of current API access, is two things:
1. Request early access at thinkingmachines.ai. The research preview exists to collect data on how the architecture performs across different conversation types and domains. Your production use case is exactly the feedback they want. Even if access takes weeks or months, being in the queue matters when wider availability opens.
2. Audit your current voice architecture for stitching overhead. If you are running a VAD + transcription + LLM + TTS pipeline, the transition to a native interaction model will require architectural changes, not just an API key swap. The integration pattern for TML will differ from current realtime APIs — the dual-component design means routing logic between foreground and background tasks needs explicit handling.
# Anticipated integration pattern (conceptual — actual SDK not yet released)
# Based on published architecture at thinkingmachines.ai/blog/interaction-models/
# Today (stitched pipeline)
audio_in → vad.detect_turn_end()
→ transcriber.transcribe(audio_segment)
→ llm.complete(transcript)
→ tts.synthesize(completion)
→ audio_out
# TML pattern (native streaming — when API releases)
# Foreground handles presence, background handles reasoning
session.stream(
audio_in, # continuous, no turn detection needed
on_complex_query=background_model.delegate # async reasoning
) → audio_out # simultaneous with input
The shift is meaningful for engineering teams. The VAD logic, transcription error handling, and TTS queue management that currently dominate voice application complexity largely disappear. What replaces them is the orchestration logic between foreground and background components — when to delegate, how to surface partial results from the background model through the foreground conversation, and how to handle background model failures gracefully.
What to Watch in the Next 90 Days
Three signals will determine whether the Thinking Machines architecture becomes a genuine platform shift or an impressive research result that does not reach production at scale.
Background model quality disclosure. The foreground model’s performance is detailed and benchmarked. The background model’s reasoning capabilities have not been published. A foreground model that routes to a mediocre background is a better conversation wrapper around weaker answers. Thinking Machines will need to demonstrate reasoning quality on established benchmarks to position TML as a complete replacement for current voice stacks rather than a faster front-end on top of an existing model.
Independent FD-bench replication. The benchmark methodology is published. When major evaluation organizations run the same tests on production systems, we will have a cleaner picture of where TML-Interaction-Small actually sits in the latency and quality landscape. Self-reported benchmarks on self-designed metrics are a starting point, not a verdict. Watch for LMSYS or Hugging Face coverage in the coming months.
API availability timeline. Thinking Machines has stated “later in 2026” for broader access. Given the latency advantages on FD-bench, a broadly available API in the second half of 2026 would land at exactly the moment voice-first agent applications are actively seeking lower-latency infrastructure. The timing is good. Whether the API is production-grade and sufficiently priced at launch is the open question that determines adoption speed.
The interaction layer of AI is significantly less saturated with competition than the reasoning layer. If Thinking Machines can pair a strong foreground interaction model with a competitive background reasoning model and bring a reliable API to market, they have a real shot at owning the interaction-native application category — the same way Whisper briefly owned high-quality open-source transcription before the realtime APIs arrived.
For developers making infrastructure decisions for voice or multi-modal agents over the next six months: add Thinking Machines to your evaluation list, request research preview access, and monitor thinkingmachines.ai/docs for API announcements. The first production-quality full-duplex interaction API will establish the pattern that voice-first agent development follows for the next several years.
Every voice architecture template, multi-modal agent starter kit, and API integration guide for the current generation of voice APIs is available at wowhow.cloud — built for production, priced once.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo · Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments · 0
No comments yet. Be the first to share your thoughts.