Mistral Medium 3.5 launched April 29, 2026 as the most capable open-weight model the French AI lab has released: a 128-billion-parameter dense model scoring 77.6% on SWE-Bench Verified, with a 256,000-token context window and native multimodal input. The launch package includes two production-ready agentic products β Vibe remote coding agents that run asynchronously in Mistral's cloud, and a new Work Mode in Le Chat that executes parallel multi-step tasks across email, calendar, documents, Jira, and Slack. API pricing starts at $1.50 per million input tokens and $7.50 per million output tokens. The weights ship under a modified MIT license and run self-hosted on four A100 80GB GPUs using vLLM, SGLang, or Ollama. This guide covers the API integration, benchmark context, Vibe remote agents, Le Chat Work Mode, pricing comparison, and self-hosting setup.
What Makes Medium 3.5 Different From Previous Mistral Models
Mistral's earlier model lineup split concerns across specialized models: Mistral Small for fast cheap tasks, Medium for balanced workloads, and Large for complex reasoning. Medium 3.5 consolidates this into a single 128B dense model that handles instruction-following, reasoning, and coding without routing to sub-models. Unlike mixture-of-experts (MoE) architectures that activate only a fraction of parameters per inference call, Medium 3.5 runs all 128 billion parameters on every request. Dense models trade per-token compute efficiency for behavioral consistency β the same reasoning quality on a quick factual lookup as on a long multi-step coding task β which is particularly relevant for agentic workloads where a single run might mix research, code generation, and document parsing in one context window.
The 256K token context window is large enough to ingest entire codebases for review, long legal or financial documents, or multi-turn agent histories without truncation. Reasoning effort is configurable per request through a budget_tokens parameter: set a low value for quick chat responses, a high value for complex analysis or refactoring tasks. The vision encoder was trained from scratch to handle variable image sizes and aspect ratios natively, so the model processes screenshots at their original dimensions without padding or cropping. The modified MIT license permits commercial self-hosting with no usage reporting requirement to Mistral, which distinguishes this release from the closed-weight frontier models it competes with on benchmark scores.
Benchmark Performance: SWE-Bench 77.6% in Context
Mistral Medium 3.5 scores 77.6% on SWE-Bench Verified, the benchmark that tests whether a model can resolve real GitHub issues by generating patches that pass the issue's test suite. The Verified subset filters the full SWE-Bench dataset to issues with high-quality test suites and confirmed fixes, making it a more reliable comparison point than the full dataset where ground truth is often ambiguous. Mistral's 77.6% was achieved with a coding agent harness β a scaffolded loop that allows the model to read files, run tests, and iterate β which is the standard methodology for production coding agents in 2026.
The second benchmark in the launch is 91.4% on Tau3-Telecom, an agentic benchmark testing multi-step task completion in realistic enterprise scenarios: scheduling, data lookup, and cross-tool orchestration. This number is operationally more relevant for teams building on Vibe remote agents or Le Chat Work Mode, because it directly tests the same task types those products execute.
For positioning context: Claude Opus 4.7 and GPT-5.5 represent the closed-weight frontier at $5β$25 per million tokens. Medium 3.5 enters below that price tier while achieving competitive SWE-Bench numbers, which makes the benchmark comparison straightforward for cost-sensitive production decisions. Mistral did not publish standard language benchmark numbers (MMLU, HumanEval, MATH) at launch β SWE-Bench and Tau3 are the primary public performance data available.
API Integration Guide
Mistral Medium 3.5 uses an OpenAI-compatible API. The model ID is mistral-medium-3.5 and the base URL is https://api.mistral.ai/v1. For teams already using the OpenAI Python SDK, switching requires changing two values and nothing else for basic requests:
from openai import OpenAI
client = OpenAI(
api_key="your-mistral-api-key",
base_url="https://api.mistral.ai/v1"
)
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[
{"role": "user", "content": "Review this function for edge cases."}
]
)
print(response.choices[0].message.content)
Configurable reasoning uses the budget_tokens parameter. For conversational tasks, a value of 512 keeps latency low. For complex coding or analytical tasks, values between 2048 and 8192 give the model enough compute to work through multi-step reasoning before responding:
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[{"role": "user", "content": "Refactor this module to be fully testable."}],
extra_body={"budget_tokens": 4096}
)
Vision input follows the standard multimodal message format. The model accepts both base64-encoded images and URL references. Because the vision encoder handles variable aspect ratios natively, you do not need to resize screenshots before sending them:
response = client.chat.completions.create(
model="mistral-medium-3.5",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "What bug does this UI screenshot show?"}
]
}]
)
Tool calling uses the standard function calling schema. Define tools as JSON schemas in the tools array, and the model returns structured tool_calls when it decides to invoke a function. The finish_reason field in the response tells your agentic loop whether to execute tools and continue, or return the final answer. The request shape is close enough to the OpenAI standard that existing agent scaffolding built for GPT-5.5 works with Mistral Medium 3.5 after a one-line base URL change.
Pricing: Where Medium 3.5 Fits in the 2026 Market
API pricing is $1.50 per million input tokens and $7.50 per million output tokens. This is a significant increase from the previous Mistral Medium 3 pricing β some analyses put the increase at roughly 4x per token β but the benchmark improvement positions the model against frontier-tier competitors rather than mid-tier alternatives.
- Claude Opus 4.7: $5.00/M input, $25.00/M output
- Mistral Medium 3.5: $1.50/M input, $7.50/M output
- Mistral Small 4: substantially cheaper, appropriate for tasks that do not require 77.6% SWE-Bench capability
The pricing criticism at launch centers on the jump from Medium 3. For teams routing medium-complexity tasks to the previous model, the cost increase without changing routing logic is real. The counterargument is that Medium 3.5 handles tasks that previously required a Large model call, and the consolidated workload is cheaper overall when routing costs are accounted for. Self-hosted inference eliminates per-token costs entirely for teams with appropriate GPU infrastructure, which is a significant factor in total cost of ownership for high-volume production workloads.
Vibe Remote Coding Agents
Vibe is Mistral's coding agent IDE, and the April 29 update introduced remote execution mode: coding sessions run on Mistral's cloud infrastructure rather than the developer's local machine. This changes the economics of long-running coding tasks substantially. In the previous local-only mode, a 60-minute refactoring task occupied a terminal session for the full duration. With remote agents, the task runs in the cloud and the developer's CLI is immediately free.
Starting a remote agent session from the CLI:
mistral vibe --remote "Refactor the auth module to support OAuth 2.0 PKCE flow, add unit tests, and open a draft PR"
If you have an existing local Vibe session mid-task, the teleport feature moves session state β context, task progress, and file changes accumulated so far β to a cloud instance. The local CLI disconnects and you receive a webhook or email notification when the session completes. Multiple remote sessions run in parallel: a team can run separate refactoring, test-writing, and documentation tasks simultaneously across different cloud instances without resource contention.
When a session finishes, Vibe can automatically open a draft pull request with a summary of what the agent changed, which tests passed, and which edge cases it flagged but did not resolve. This is a similar pattern to the Claude Code subagent coordination approach, applied natively to the Mistral ecosystem. Remote compute pricing was not announced at launch; Mistral has indicated billing will be per compute-minute rather than per output token, making long-running tasks more predictable to cost.
Le Chat Work Mode: Agentic Task Execution in the Browser
Le Chat's Work Mode is a production release of multi-step agentic task execution inside Mistral's chat interface. Standard Le Chat is a conversational interface requiring back-and-forth input. Work Mode accepts a task description and executes it end-to-end without needing intermediate prompts.
The execution model uses parallel tool calling: multiple tools run simultaneously rather than sequentially. In a single Work Mode session, the agent can read an email thread, add a calendar event, draft a reply, create a Jira ticket from the action items, and post a summary to Slack β with the total latency closer to the longest single operation than the sum of all operations combined.
Integrations at launch include Google Workspace (Gmail, Calendar, Drive), Jira, Slack, GitHub, and Notion. Each integration uses OAuth and does not expose credentials to the model weights β the model issues tool calls and Mistral's infrastructure handles the actual API requests to connected services. For developers, Work Mode is also available programmatically through the Mistral Agents API, letting you embed the same parallel tool execution into custom products. See our guide on agentic AI governance for architectural context on deploying multi-step agents in production.
Self-Hosting Mistral Medium 3.5
The model weights are available on HuggingFace at mistralai/Mistral-Medium-3.5-128B under a modified MIT license. Hardware requirements: a minimum of four A100 80GB GPUs for basic inference, with eight GPUs recommended for concurrent production workloads with good throughput.
Starting a vLLM inference server with tool calling and reasoning support enabled:
vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral
The --reasoning-parser mistral flag instructs vLLM to separate the model's internal reasoning trace from the final output in the response body β required if you want to log or display reasoning steps in your application. The --tool-call-parser mistral flag enables structured function calling output compatible with Mistral's tool calling format.
SGLang is an alternative inference backend with better batch efficiency at high request concurrency. For low-resource setups, Ollama supports quantized versions of Medium 3.5 that run on fewer GPUs at reduced quality β the quality reduction is more noticeable in multi-step coding tasks than in simple question-answering. For enterprise infrastructure, NVIDIA NIM containers are available with pre-configured tensor parallelism and quantization profiles for H100 clusters, which simplifies production deployment without manual vLLM parameter tuning.
When to Choose Mistral Medium 3.5
The model fits best in three scenarios. First, workloads that require coding capability at the frontier tier but are subject to data sovereignty or compliance constraints that prohibit closed-weight APIs. The modified MIT license permits commercial self-hosting without usage reporting. Second, production workflows that genuinely mix text, code, and image inputs in a single context β the native vision encoder and dense architecture handle this without model-switching overhead. Third, teams already invested in the Mistral ecosystem who want to use Work Mode and Vibe remote agents without adding new infrastructure.
Where Medium 3.5 is not the first choice: pure-text reasoning tasks with no coding or vision component, where Mistral Small 4 handles the workload at 10β20% of the per-token cost. The 4x price increase from Medium 3 is also a factor for teams with high-volume inference needs whose workloads do not require the full 77.6% SWE-Bench capability profile.
Quick decision rule: If your task involves code + images + long context and you need open-weight deployment, Medium 3.5 is the strongest option currently available. If it's pure text at high volume, route to Mistral Small 4 first and escalate only when Small 4 quality is insufficient.
Summary
Mistral Medium 3.5 is a 128B dense open-weight model that scores 77.6% on SWE-Bench Verified, ships under a modified MIT license, and is priced at $1.50 per million input tokens. The Vibe remote agent system makes long-running async coding tasks practical without tying up local resources. Le Chat Work Mode brings parallel multi-step agentic execution to the browser with OAuth-connected integrations. Self-hosting on four A100 GPUs eliminates per-token costs for high-volume workloads. For developers evaluating open-weight coding-capable models in May 2026, this release establishes a new cost-performance threshold: near-frontier benchmark numbers at roughly one-third the API cost of the top closed-weight alternatives.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.
Monday Memo Β· Free
One insight, every Monday. 7am IST. Zero fluff.
1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.
Comments Β· 0
No comments yet. Be the first to share your thoughts.