Pricing Breakdown
Gemini 3.5 Flash pricing is structured as follows:
- Input tokens: $1.50 per million (standard regions); $1.65/M non-global
- Output tokens: $9.00 per million (standard); $9.90/M non-global
- Cached input tokens: $0.15 per million — 90% cheaper than uncached input
- Thinking tokens: Billed at the output rate when
thinking_level is set; thought preservation adds tokens across multi-turn conversations
To compare directly: Gemini 3.1 Pro was priced at $2.00 input / $12.00 output per million tokens. Switching to 3.5 Flash cuts both input and output costs by 25%, while delivering better benchmark scores. For applications using context caching — storing system prompts, reference documents, or shared instructions — the $0.15/M cached input rate makes repeated-context reads effectively free.
A concrete cost example: a production application making 10,000 API calls per day, averaging 2,000 input tokens and 500 output tokens per call, spends roughly $30/day on input and $45/day on output using 3.5 Flash. The same workload on 3.1 Pro cost $40/day input and $60/day output. The monthly cost difference is approximately $1,500 — without any code changes beyond updating the model string.
Migration Guide: Three Breaking API Changes
Gemini 3.5 Flash introduces three API-level changes that affect existing code targeting Gemini 3.1 Pro or the gemini-3-flash-preview preview model. Each is documented below with the exact change required.
1. Model Name Update
The stable GA identifier replaces the preview suffix. Update your model string:
// Before: preview or 3.1 Pro
const model = genAI.getGenerativeModel({ model: "gemini-3-flash-preview" });
// After: Gemini 3.5 Flash stable GA
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });
2. Thinking Configuration: Enum Replaces Integer
The thinking_budget integer parameter has been replaced with a thinking_level string enum. The default changed from high to medium. Remove temperature, top_p, and top_k from your generation config — they are no longer recommended and are silently ignored in Flash 3.5.
// Before: 3.1 Pro thinking config
const result = await model.generateContent({
contents: [{ role: "user", parts: [{ text: prompt }] }],
generationConfig: {
temperature: 0.7,
top_p: 0.95,
thinking_budget: 8192,
},
});
// After: 3.5 Flash thinking config
const result = await model.generateContent({
contents: [{ role: "user", parts: [{ text: prompt }] }],
generationConfig: {
thinking_level: "high", // "low" | "medium" | "high"
maxOutputTokens: 65000,
},
});
Important: Thought preservation is on by default in 3.5 Flash. In multi-turn conversations, the model carries thinking tokens forward across turns, which increases total billed token usage. Audit your multi-turn flows after migration and add explicit context truncation if costs rise unexpectedly on long sessions.
3. Function Calling: Add id and name to FunctionResponse
Function responses now require explicit id and name fields matching the originating function call from the model:
// Before: 3.1 Pro function response
const functionResponse = {
role: "function",
parts: [{
functionResponse: { name: "get_weather", response: { temp: 22 } }
}],
};
// After: 3.5 Flash function response — id and name required
const functionResponse = {
role: "function",
parts: [{
functionResponse: {
id: functionCall.id, // must match the id from the model FunctionCall part
name: "get_weather", // must match the function name exactly
response: { temp: 22 },
},
}],
};
If you use multimodal function responses (returning images or structured data from a function call), move the multimodal content inside the functionResponse part, not alongside it as a separate part. If you append inline instructions to a function response, append them to the response text field separated by two newlines — not as a separate content part.
TypeScript Code Examples
Basic Generation
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({
model: "gemini-3.5-flash",
generationConfig: {
thinking_level: "medium",
maxOutputTokens: 8192,
},
});
const result = await model.generateContent(
"Explain the trade-offs between RSA and ECC for TLS certificates in production."
);
console.log(result.response.text());
Multimodal Input: PDF Analysis
Flash 3.5 handles PDF, image, audio, and video natively in a single API call — matching the full multimodal input now available in the Google Search box:
import fs from "fs";
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });
const pdfBuffer = fs.readFileSync("./contract.pdf");
const base64Pdf = pdfBuffer.toString("base64");
const result = await model.generateContent({
contents: [{
role: "user",
parts: [
{
text: "List all termination clauses and auto-renewal provisions in this contract.",
},
{
inlineData: {
mimeType: "application/pdf",
data: base64Pdf,
},
},
],
}],
generationConfig: { thinking_level: "high", maxOutputTokens: 4096 },
});
console.log(result.response.text());
Agentic Function Calling with id Matching
import {
GoogleGenerativeAI,
type FunctionDeclaration,
type Tool,
} from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: "gemini-3.5-flash" });
const tools: Tool[] = [{
functionDeclarations: [{
name: "search_database",
description: "Search the internal product database.",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query" },
limit: { type: "number", description: "Max results to return" },
},
required: ["query"],
},
} as FunctionDeclaration],
}];
const chat = model.startChat({ tools });
const response = await chat.sendMessage(
"Find the top 5 products in the developer tools category."
);
const candidate = response.response.candidates?.[0];
const fnCallPart = candidate?.content.parts.find((p) => p.functionCall);
if (fnCallPart?.functionCall) {
const { name, args, id } = fnCallPart.functionCall;
const dbResults = await searchDatabase(
args as { query: string; limit?: number }
);
// id and name are required in 3.5 Flash — different from 3.1 Pro
const followUp = await chat.sendMessage([{
functionResponse: {
id,
name,
response: { results: dbResults },
},
}]);
console.log(followUp.response.text());
}
When to Use Flash vs Pro vs Ultra
With Flash 3.5 outperforming 3.1 Pro on most benchmarks at lower cost, the model routing decision has simplified significantly:
Use Gemini 3.5 Flash for the large majority of production workloads: chat interfaces, document analysis, coding assistance, agentic pipelines with MCP tool use, batch processing, and any task where throughput or cost matters. The model now handles tasks that previously required Pro. This is the new default for new projects starting today.
Use Gemini 3.1 Pro for Computer Use tasks only. Flash 3.5 does not support Computer Use at launch. If your pipeline includes browser automation, desktop interaction, or screen reading via computer use, 3.1 Pro remains the required model until Google adds support to the Flash series.
Use Gemini 3.5 Pro (expected approximately June 2026) for tasks requiring maximum reasoning depth: complex multi-step research, long legal or financial document synthesis requiring deep inference, or tasks where token budget is not a constraint.
Use Gemini 3.5 Ultra for the highest-stakes reasoning when it ships. Flash handles the throughput tier; Pro and Ultra handle the capability ceiling.
Agentic Use Cases: Where Flash 3.5 Was Optimized
The benchmark composition signals where Google spent optimization effort. MCP Atlas at 83.6% means Flash 3.5 reliably completes multi-step tool chains — the class of tasks where earlier models frequently stalled on malformed or out-of-order tool calls. GDPval-AA at 1656 Elo means it completes real-world agentic tasks at a rate approaching human-level performance on the evaluation set.
The model is the backend for two of Google’s largest-scale agent deployments: Gemini Spark — the persistent 24/7 agent platform covered in depth in the Gemini Spark and Antigravity 2.0 developer guide — and Google AI Mode, which crossed one billion monthly users as covered in the Google AI Mode analysis from I/O 2026. Both require the model to maintain goal state across extended interactions and complete tool-use chains reliably at scale.
For developers building production agents: migrating an existing agentic pipeline from 3.1 Pro to 3.5 Flash should produce measurable improvement in task completion rates and reduce tool-call error recovery overhead. If your current agent has retry logic specifically to handle malformed function calls from the model, test whether that logic is still exercised after migration — with MCP Atlas at 83.6%, it often will not be.
Context Window: Using 1M Tokens Responsibly
The 1M-token context window and 65k output token limit are available in Flash 3.5 on GA. At $1.50/M input with $0.15/M cached, large context becomes economically viable at scale. A 200-page technical document, a complete application codebase, or six months of customer support transcripts can fit in a single context window and be reasoned over in a single call.
Two things to keep in mind. First, thought preservation across multi-turn conversations adds thinking tokens to the context accumulated over a session. A ten-turn conversation with thinking_level: "high" can accumulate substantially more tokens than the raw user/assistant turn count implies. Monitor token usage on long sessions before they reach production. Second, thinking tokens do not count against the 65k output limit but are billed at the output token rate. Structure your thinking budget accordingly.
Migration Checklist
- Update model identifier from
gemini-3-flash-preview or gemini-3.1-pro to gemini-3.5-flash
- Replace
thinking_budget: number with thinking_level: "low" | "medium" | "high"
- Remove
temperature, top_p, and top_k from all generation configs
- Add
id and name fields to every FunctionResponse part
- Move multimodal content inside
functionResponse parts if using multimodal function responses
- Audit multi-turn conversation token usage — thought preservation is now on by default
- Remove Computer Use logic from Flash paths; route those calls to Gemini 3.1 Pro separately
- Update SDK to
@google/[email protected] or later for the updated type definitions
- Run your existing test suite against the new model identifier before promoting to production
Getting Started Today
Gemini 3.5 Flash is generally available via the Gemini API as of May 19, 2026. The model string is gemini-3.5-flash. Update your SDK first:
npm install @google/generative-ai@latest
Google AI Studio at ai.google.dev has been updated to use Flash 3.5 as the default model in the playground. If you have been testing prompts in AI Studio since I/O and noticed faster responses, that is the model change. The API pricing page at ai.google.dev/gemini-api/docs/pricing reflects the current Flash 3.5 rates.
For teams starting new agentic projects: the Antigravity 2.0 developer platform, released alongside Flash 3.5 at I/O, provides a full agent harness, multi-agent orchestration, and a managed MCP gateway that runs Flash 3.5 as its default. The Gemini Spark and Antigravity 2.0 guide covers the platform in full if you are evaluating whether to build your orchestration layer from scratch or adopt the Google-native stack.
Developer tools, starter kits, and API integration templates for Gemini and other frontier models are available at wowhow.cloud.
Comments · 0
No comments yet. Be the first to share your thoughts.