Google's Gemini 3.1 Ultra launched with a 2 million token context window — the largest ever shipped in a production API. Here is what it actually means for developers: the practical measurements, the five use cases that were impossible before, the cost math, and how to use context caching to keep your bills under control.
Google's Gemini 3.1 Ultra, released in late March 2026, ships with a 2 million token context window — the largest ever available in a production API. That is not a benchmark number. You can use all 2 million tokens today via the Gemini API, Google AI Studio, and Vertex AI. For developers building agentic systems, document intelligence pipelines, and long-horizon reasoning tasks, this represents a genuine architectural shift in what is possible in a single model call. This guide breaks down what 2M tokens actually looks like, which use cases justify it, and how to avoid a surprise bill when you start experimenting.
What 2 Million Tokens Actually Looks Like
Token counts are abstract until you map them to the content you work with every day. Based on our testing with Gemini 3.1 Ultra's tokenizer, here is what 2 million tokens translates to in practice:
- Text: Approximately 1.4 million words — equivalent to about 2,800 pages of standard prose, or 10 to 14 full-length novels.
- Code: Roughly 15,000 to 18,000 lines of commented source code, depending on language density. A large Next.js or Django monorepo fits comfortably in a single context.
- Audio transcripts: Around 140 hours of transcribed speech at average speaking pace.
- PDFs: Approximately 350 to 400 dense research papers, assuming average academic paper length.
- Conversation history: A 200-turn agent conversation with tool call payloads included — the full session, not a truncated summary.
For comparison: GPT-5.4's standard context window is 128K tokens (about 96,000 words), and Claude Opus 4.6 tops out at 200K tokens (around 150,000 words). Gemini 3.1 Ultra's 2M window is roughly 10x the size of GPT-5.4's standard offering and 10x Claude Opus 4.6's maximum. According to our analysis of the three leading frontier models, Gemini 3.1 Ultra is the only model today where the practical ceiling is your content size, not the context limit.
Why This Matters for Agentic Development
Before 2M context windows, every long-context application required one of three workarounds: chunking documents and running multiple inference calls, building a retrieval-augmented generation (RAG) pipeline to retrieve relevant sections, or summarizing intermediate results and losing information in the compression. Each approach adds latency, complexity, and failure modes. Large context eliminates the need for these workarounds in a significant class of problems.
The most impactful change for agent developers is the elimination of state summarization. Agentic systems that run over many hours — filing expenses, researching and drafting reports, operating a computer to complete a workflow — accumulate context rapidly. With 128K or 200K limits, agents need to compress their working memory periodically, and this compression introduces errors. The model loses track of decisions it made earlier, contradicts itself, or fails to notice that a constraint set in turn 5 is violated by an action in turn 87. A 2M window is large enough that most practical multi-hour agent tasks complete before the context ceiling is hit, making the agent more reliable without any change to the underlying prompt engineering.
Five Developer Use Cases That Were Impractical Before
1. Full Codebase Review and Automated Refactoring
A mid-size SaaS codebase — say, 80,000 to 120,000 lines across 400 files — fits in a single Gemini 3.1 Ultra context. You can load the entire codebase and ask for a security audit, identify all places a deprecated API is used, propose a refactoring plan, or generate a migration guide for a dependency upgrade — all in a single call with full cross-file awareness. Previously this required either a purpose-built code indexing system (like the ones GitHub Copilot or Claude Code use internally) or breaking the codebase into chunks and losing the cross-file references that are most valuable for refactoring.
2. Legal and Contract Intelligence
A complete enterprise software contract, including master service agreement, data processing addendum, statement of work, and exhibits, commonly runs 200 to 350 pages. Legal teams reviewing vendor contracts for compliance with internal standards — data residency requirements, liability caps, indemnification clauses — can load the full contract stack into a single context and query it comprehensively. Law firms that have piloted this workflow report that the model's cross-document reasoning (for example, catching a clause in an exhibit that contradicts the liability cap in the MSA) is substantially better when the full context is available vs when documents are processed separately.
3. Research Synthesis Across Large Literature Pools
Academics and analysts routinely work with 50 to 100 papers on a topic. Loading 60 average-length research papers into a single 2M context call allows the model to identify contradictions across studies, trace how a methodology evolved over time, flag papers with statistical anomalies relative to their claimed results, and synthesize a literature review that covers the full corpus. RAG-based literature tools select a subset of papers per query and lose the comparative signal that comes from reading the whole collection in parallel.
4. Enterprise Log and Telemetry Analysis
Production incident analysis often requires correlating logs from multiple services over a multi-hour window. A 4-hour span of application logs, database query logs, and infrastructure metrics for a medium-size service can easily fit within 2M tokens. Loading the full log window allows the model to trace cascading failures, identify the root cause event in context, and generate a post-incident report — without needing a separate log aggregation pipeline or a specialized observability tool that chunks the logs for you.
5. Full-Session Agent Memory Without Summarization
For developers building personal assistant agents or long-running automation agents, the 2M context window means you can include full conversation history from a user's entire session without lossy compression. A user who spends 3 hours using an AI research assistant accumulates far less than 2M tokens of conversation — meaning the agent retains every preference, decision, and piece of context the user has shared, without needing an external memory store for most practical sessions.
Cost Breakdown: Avoiding Bill Shock
The capability is impressive. The cost requires careful planning. Gemini 3.1 Ultra's pricing at launch is approximately $3.50 per million input tokens and $10.50 per million output tokens on the standard tier via the Gemini API. A naive full 2M-token input call costs around $7 per request before output tokens. Run this 100 times a day and you are at $700/day — $21,000/month — before you have written a single line of business logic on top of it.
This does not mean 2M context is prohibitively expensive. It means you need to design for it. The practical cost management strategy has two components:
Context Caching: The 75% Discount
Google's context caching feature allows you to cache a fixed portion of the prompt (the "system context") and pay a dramatically reduced rate for cached tokens on subsequent calls. Cached tokens cost approximately $0.875 per million — a 75% reduction from the full input rate. The cache TTL is configurable from 5 minutes to 24 hours.
The design pattern this enables: load your large static context (codebase, document set, long system prompt) once, cache it, then issue many short queries against it. If 80% of your 2M tokens are in a cached context, your effective per-call input cost drops from $7.00 to roughly $1.75 — transforming an enterprise-only use case into something viable at startup scale.
Right-Sizing Your Context
Not every query needs 2M tokens. Use the full window for tasks where cross-document reasoning matters — audit passes, synthesis queries, full-history agent calls. Use Gemini 3.1 Flash Lite (at a fraction of the cost) for high-frequency, focused queries against well-scoped documents. The practical pattern for a production document intelligence system: index and retrieve with a cheap model, then load the retrieved set plus global context into Gemini 3.1 Ultra for final synthesis.
Gemini API: Getting Started With Long Context
Here is a minimal Python example using context caching with the Gemini API to load a large codebase once and query it repeatedly:
import google.generativeai as genai
from google.generativeai import caching
import datetime
genai.configure(api_key="YOUR_API_KEY")
# Load your large static context (e.g., entire codebase as text)
with open("codebase_dump.txt", "r") as f:
codebase = f.read()
# Create a cached context with a 1-hour TTL
cache = caching.CachedContent.create(
model="gemini-3.1-ultra",
display_name="my-codebase-cache",
system_instruction="You are an expert code reviewer. Analyze code for security issues, performance problems, and anti-patterns.",
contents=[codebase],
ttl=datetime.timedelta(hours=1),
)
# Create a model instance that uses the cached context
model = genai.GenerativeModel.from_cached_content(
cached_content=cache
)
# Query repeatedly against the cached context — each call is cheap
response = model.generate_content(
"List all places where user input is passed to a SQL query without parameterization."
)
print(response.text)
# Clean up when done
cache.delete()
Each subsequent query against the same cached context costs only the new input tokens (your question) plus the cached-rate tokens for the stored context — dramatically cheaper than reloading the full context on every call. For batch audit tasks — run 50 different security checks against the same codebase — context caching turns what would be a $350 job into a $50 job.
Context Window Comparison: The Current Frontier
Based on our analysis of the leading frontier models available via API as of April 2026:
| Model | Max Context | Input Cost (1M tokens) | Context Caching | Best For |
|---|---|---|---|---|
| Gemini 3.1 Ultra | 2M tokens | $3.50 | Yes — 75% discount | Full-corpus reasoning, large codebases, long-session agents |
| Gemini 3.1 Pro | 2M tokens | $2.50 | Yes — 75% discount | Cost-effective long context for most enterprise tasks |
| GPT-5.4 | 128K (1M enterprise) | $2.50 | Yes — 50% discount | Agentic workflows, computer use, coding |
| Claude Opus 4.6 | 200K | $15 | Yes — up to 90% discount | Code generation, precise instruction following |
| Gemini 3.1 Flash Lite | 1M tokens | $0.075 | Yes | High-volume, cost-sensitive tasks |
Gemini 3.1 Pro's 2M window at $2.50/M input is the pragmatic choice for most teams: you get the same context ceiling as Ultra at a lower cost, with Ultra reserved for tasks requiring the highest reasoning quality. Note that GPT-5.5 ("Spud"), expected to ship in late April 2026, is rumored to extend OpenAI's standard context to 1M tokens natively — which would close the gap significantly. Until then, Gemini holds a clear lead on context length.
When NOT to Use the 2M Context Window
Large context is a tool, not a default setting. There are clear cases where it is the wrong choice:
- Simple, focused queries: If you are asking a factual question, generating a short email, or classifying a document, a 2M context window adds latency and cost with no benefit. Use a fast, small model.
- High-frequency, low-context pipelines: If you are running 10,000 classification calls per day with a 500-word input, Gemini 3.1 Flash Lite at $0.075/M tokens is 47x cheaper than Ultra for the same task quality.
- When retrieval-augmented generation is already working: If you have a mature RAG system with high retrieval precision, rebuilding it around large context may deliver marginal gains while adding operational complexity. The 2M window shines most when the retrieval step itself is the bottleneck — when the right answer requires reasoning across documents that a similarity search would not co-locate.
- Latency-sensitive applications: Processing 2M tokens takes time. For real-time user-facing applications where response time is under 2 seconds, large context calls add unacceptable latency. Use streaming and right-size your context to what the query actually needs.
What Comes Next in the Context Window Race
The 2M token context window is a significant milestone, but the arms race is not over. GPT-5.5 (Spud) is expected to push OpenAI's standard context to at least 512K tokens, with some credible leaks suggesting 1M. Anthropic's Claude roadmap hints at extending beyond 200K in the next major update. The competition is pushing every provider toward larger, faster, and cheaper context handling.
The more interesting development, however, is not raw context size — it is context quality. A model that reliably reasons across 2M tokens without "losing" facts from the middle of the context (a known failure mode called "lost in the middle") is more valuable than a model with a 4M window that degrades in quality past 500K. Google's published evaluations show Gemini 3.1 Ultra maintaining consistent accuracy across the full 2M window, but independent third-party evaluations of large-context reliability are still emerging. According to our testing, the model handles well-structured large contexts (clearly organized sections, consistent formatting) significantly more reliably than dense, unstructured text at scale.
For developers planning their AI infrastructure today, the practical recommendation is to build your data pipelines and agent architectures to support variable context lengths. The models that handle long context well will only get cheaper over time — but the teams that designed their systems to take advantage of it from the start will compound that benefit as pricing drops.
The Bottom Line
Gemini 3.1 Ultra's 2 million token context window is the most practically significant context expansion since the shift from 4K to 32K tokens. It makes a real class of problems tractable that were not before: full-codebase reasoning, multi-document legal analysis, complete agent session memory, and large-corpus research synthesis. The cost, at $3.50 per million input tokens, is manageable when you design for context caching — the pattern of loading a large static context once and querying it many times reduces effective costs by 75%. The main mistake to avoid is treating it as a default: use large context for the tasks that genuinely need it, and right-size everything else.
Want to see how Gemini 3.1 Ultra compares across the full range of benchmarks? Read our April 2026 benchmark deep dive comparing GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6. Or explore our guide to Gemini 3.1 Flash Lite for the opposite end of the cost-performance spectrum.