Tencent Hy3 Preview: 295B Open-Source MoE Developer Guide 2026

Tencent just open-sourced a 295-billion-parameter model that went from cold start to production in under three months — and it posted a 74.4% score on SWE-bench Verified, the highest any previous Tencent model has ever achieved. Hy3 Preview, released on April 23, 2026, is the first output of a complete rebuild of Tencent’s Hunyuan pretraining and reinforcement learning infrastructure. The team is led by Yao Shunyu, a former OpenAI researcher who joined Tencent’s AI division in early 2026.

The key numbers: 295B total parameters with only 21B active (Mixture-of-Experts architecture), a 256,000-token context window, and a hybrid fast-slow thinking design. On SWE-bench Verified, it scores 74.4% — a 40% relative improvement over Hy2’s 53%. It drives agentic workflows of up to 495 steps. And it was already deployed inside WeChat, QQ, and Yuanbao before the public announcement landed.

This guide covers the architecture, every major benchmark, how to run it locally with vLLM or SGLang, how to call it via the Tencent Cloud API, and what it means for developers choosing between open models and frontier API subscriptions in mid-2026.

The Rebuild Story

Hy3 Preview is not a fine-tune of a previous Hunyuan checkpoint. It is the first product of a ground-up reconstruction of Tencent’s AI stack, started in late January 2026 and completed in roughly 84 days. That timeline — under three months from cold start to open-source public release — is unusual for a model at this scale.

The compression was possible partly because Tencent deployed Hy3 inside production products before announcing it publicly. Yuanbao (Tencent’s AI assistant), CodeBuddy (their developer copilot), WorkBuddy, and Tencent Docs all ran Hy3 in live production traffic before the weights were posted to Hugging Face. Real user traffic at Tencent’s scale surfaces failure modes that synthetic benchmarks and internal red-teaming miss. By the time Hy3 hit Hugging Face, it had already been debugged at WeChat-scale concurrency.

Yao Shunyu’s background at OpenAI focused on agentic evaluation and interactivity. That background shaped Hy3’s design priorities directly: rather than optimizing purely for single-question benchmark scores, the team treated real-world multi-step agent task completion as a first-class target. That decision shows up in the results.

Architecture: What Makes 295B Act Like 21B

Mixture-of-Experts

Hy3 Preview uses a Mixture-of-Experts (MoE) architecture. Each forward pass activates only 21B out of 295B total parameters — roughly 7%. The remaining parameters belong to specialized expert layers that are selectively routed to based on the input token. From a deployment perspective, the active memory and compute footprint resembles a 21B dense model, not a 295B dense model.

In practice: a 21B dense model in BF16 requires approximately 42 GB of VRAM. With 4-bit quantization (AWQ or GPTQ), Hy3’s effective footprint drops to roughly 148 GB — which fits on two H100 80GB GPUs with room remaining for KV cache. Running a 295B-class model on two datacenter GPUs was not realistic six months ago.

Hybrid Fast-Slow Thinking

Hy3 Preview supports two inference modes, selected at runtime via a system prompt parameter:

Non-Thinking mode: Fast, direct responses. Behaves like a standard instruction-following model. Appropriate for retrieval, summarization, classification, and tasks where speed matters more than reasoning depth.
Thinking mode: Extended chain-of-thought reasoning, similar to o1-style models. The model generates an internal reasoning trace before producing its final answer. Recommended for math, coding, and complex agentic tasks where correctness outweighs latency.

Both modes share identical weights. The thinking budget can be capped numerically, giving you cost control without disabling the mode entirely. A higher budget allows the model to spend more tokens on internal reasoning before committing to an answer — up to 32,768 tokens for the most demanding tasks.

Context Window

Hy3 Preview supports up to 256,000 tokens. This is not the 1M-token context that DeepSeek V4-Pro ships with, but 256K covers the vast majority of real-world agentic use cases: a full medium-sized codebase, a 200-page PDF, or dozens of rounds of a multi-tool agent conversation all fit comfortably. According to Tencent’s internal RULER benchmark testing, the model maintains consistent performance without degradation across the full 256K window — a bar that many models with larger advertised context windows fail to meet cleanly in practice.

Benchmarks: Where Hy3 Stands

SWE-bench Verified

SWE-bench Verified tests whether a model can autonomously fix real GitHub issues on real codebases, verified by running the original test suite. No partial credit, no handcrafted prompts. It is the hardest commonly-used coding benchmark and the one most predictive of whether a model will help a developer in a real workflow.

Hy3 Preview: 74.4%
Hy2 (previous generation): 53.0%
Improvement: +21.4 percentage points, a 40% relative gain

For context: in early 2025, a 74% SWE-bench score was frontier territory. As of April 2026, it sits behind DeepSeek V4-Pro’s 80.6% and Claude Opus 4.6’s 80.8%, but it is a legitimate result from a model you can self-host on accessible hardware.

Terminal-Bench 2.0

Terminal-Bench 2.0 tests agentic command-line task completion in a real shell environment: navigating file systems, running tests, reading logs, and interpreting command output. Hy3 scores 54.4%, reflecting the team’s deliberate investment in agentic tooling and execution reliability beyond pure code generation.

Reasoning and Mathematics

On the Tsinghua University math PhD qualifying exam (Spring 2026 edition), Hy3 Preview scored 88.4, the top result among Chinese models on that benchmark. The result signals that thinking mode works as intended: PhD-level mathematics is precisely the kind of task where extended reasoning tokens produce measurable accuracy gains over fast, direct responses.

Agentic Search

On WideSearch (multi-step web research) and BrowseComp (browser navigation and synthesis), Hy3 scores 70.2% and 81.3% respectively. These benchmarks test agent stability across multi-step information retrieval, where context management failures and tool-invocation errors accumulate rapidly in weaker models.

Agent Workflow Stability at 495 Steps

Tencent’s internal testing reports that Hy3 Preview has “stably supported complex agent workflows of up to 495 steps in real user environments, spanning document handling, data analysis, knowledge retrieval, and tool orchestration.” In most agent frameworks, context management failures appear well before step 100. Reaching 495 steps in live production traffic — not a benchmark harness — is a meaningful operational result that most open model releases cannot yet claim.

Production Deployment Before the Public Launch

Before the open-source release, Tencent had already deployed Hy3 across more than ten core products:

Yuanbao — Tencent’s general AI assistant, comparable in scope to ChatGPT
CodeBuddy — Developer copilot, similar to GitHub Copilot
WorkBuddy — Enterprise productivity assistant integrated with Tencent’s office suite
Tencent Docs — Document editing platform with hundreds of millions of users in China
WeChat and QQ — AI assistant integrations reaching over a billion active users combined
Peacekeeper Elite — Tencent’s flagship mobile game, testing AI-driven NPC dialogue

This production-first release strategy is uncommon in the open-source model space. Most open-weight releases drop weights after internal research testing. Running a model inside a billion-user consumer platform before public release is a qualitatively different stress test — and it means the Hy3 Preview weights on Hugging Face have already absorbed production debugging that most open releases lack entirely.

How to Run Hy3 Preview Locally

Hardware Requirements

Full BF16 weights require approximately 590 GB of VRAM, which is impractical for most teams. The practical deployment configurations are:

4-bit AWQ quantization: ~148 GB VRAM — two H100 80GB GPUs
8-bit quantization: ~295 GB VRAM — four A100 80GB or three H100 80GB GPUs
FP8 via SGLang: ~148 GB VRAM — two H100 80GB GPUs with better throughput than AWQ

vLLM Deployment

pip install "vllm>=0.8.0"

python -m vllm.entrypoints.openai.api_server --model tencent/Hy3-preview --tensor-parallel-size 2 --quantization awq --max-model-len 65536

This launches an OpenAI-compatible server on port 8000. For thinking mode, set --max-model-len to at least 131072 to give the model room for internal chain-of-thought tokens. Enabling thinking mode substantially increases time-to-first-token, so use it only where accuracy takes priority over latency.

SGLang Deployment

pip install "sglang>=0.4.0"

python -m sglang.launch_server --model-path tencent/Hy3-preview --tp 2 --quantization fp8

SGLang achieves lower latency than vLLM for multi-step agent workflows because of its radix attention caching, which reuses KV cache across shared prefixes in long conversations. For tasks running 100+ steps, SGLang is the better deployment choice.

Activating Thinking Mode

Thinking mode is enabled by adding a structured header to your system prompt:

{
  "role": "system",
  "content": "<Think>enabled, budget: 8192</Think> You are a helpful assistant."
}

Increase the budget value for harder tasks (up to 32768 for mathematical proofs and complex multi-step planning). The model consumes thinking tokens up to the specified budget before producing its final response. For coding tasks, a budget of 8192 is generally sufficient; for hard math, 16384 or higher is recommended.

API Access via Tencent Cloud

For teams not running their own inference infrastructure, Tencent Cloud’s TokenHub provides managed API access at launch pricing of RMB 1.2 per million input tokens (approximately $0.17/M USD) and RMB 4 per million output tokens (approximately $0.55/M USD).

The API is OpenAI ChatCompletions-compatible. Any SDK or framework that targets the OpenAI API format — LangChain, LlamaIndex, CrewAI, AutoGen, or the standard openai Python package — works without code changes beyond a base URL swap. At these prices, Hy3 via TokenHub runs at roughly half the cost of Claude Sonnet 4.6 for equivalent output quality on coding-heavy workloads.

Hy3 vs. DeepSeek V4: The Open-Source Frontier in April 2026

Hy3 Preview and DeepSeek V4 launched within 24 hours of each other, making a direct comparison inevitable. They occupy different positions in the capability-cost tradeoff:

DeepSeek V4-Pro (1.6T parameters, 49B active) has stronger raw benchmarks: 80.6% SWE-bench Verified versus Hy3’s 74.4%, and a 1M-token context window versus Hy3’s 256K. For tasks demanding maximum accuracy or extremely long context, V4-Pro is the stronger choice.

Hy3 Preview has a lower hardware requirement for self-hosting, a production deployment track record inside a billion-user platform that V4-Pro as a fresh research release does not yet have, and meaningfully lower API pricing at launch. For teams building production agent systems where self-hostability, cost, and operational reliability matter alongside raw benchmark position, Hy3 is a serious contender.

The broader pattern matters more than any single comparison: three major Chinese labs now have legitimate frontier-class open models. Each release raises the quality floor for what an open-source deployment can achieve and adds competitive pressure on closed API providers to justify their pricing premium.

The Bottom Line

Hy3 Preview matters for three reasons beyond its individual benchmark scores.

First, the velocity: 84 days from cold start to production deployment at WeChat scale to open-source release. That is a new baseline for how fast a well-resourced team can move through the full AI development lifecycle, and it signals that the gap between research and production is compressing across the board.

Second, the accessibility: a model scoring 74.4% on SWE-bench Verified that fits on two H100s was not available six months ago. The hardware barrier to running frontier-quality open models keeps falling in a way that changes the economics of building AI products.

Third, the production track record: Hy3 was not released untested. It ran in Yuanbao, WeChat, and Peacekeeper Elite at scale before the weights hit Hugging Face. That is a different kind of reliability guarantee than most open model releases can offer.

If you are building agent systems, coding tools, or long-context document workflows and currently paying frontier API prices, Hy3 Preview is worth evaluating this week. The weights are at tencent/Hy3-preview on Hugging Face. The TokenHub API is live. At 74.4% SWE-bench on two H100s at $0.55/M output tokens, the cost-quality tradeoff has become genuinely interesting.

Tags:tencentopen-sourceai modelsagentic aimoe architecture

All Articles

Written by

Anup Karanjkar

Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.

Ready to ship faster?

Browse our catalog of 3,000+ premium dev tools, prompt packs, and templates.

Browse Products More Articles

Monday Memo · Free

One insight, every Monday. 7am IST. Zero fluff.

1 field report, 3 links, 1 tool we actually use. Join 11,200+ builders.

Comments · 0

No comments yet. Be the first to share your thoughts.

The Rebuild Story

Architecture: What Makes 295B Act Like 21B

Mixture-of-Experts

Hybrid Fast-Slow Thinking

Context Window

Benchmarks: Where Hy3 Stands

SWE-bench Verified

Terminal-Bench 2.0

Reasoning and Mathematics

Agentic Search

Agent Workflow Stability at 495 Steps

Production Deployment Before the Public Launch

How to Run Hy3 Preview Locally

Hardware Requirements

vLLM Deployment

SGLang Deployment

Activating Thinking Mode

API Access via Tencent Cloud

Hy3 vs. DeepSeek V4: The Open-Source Frontier in April 2026

The Bottom Line

Ready to ship faster?

One insight, every Monday. 7am IST. Zero fluff.

Comments · 0

Key takeaways · 6

Topics

Try Our Free Tools

JSON Formatter & Validator

cURL to Code Converter

More from AI Tool Reviews

IBM Bob: Enterprise AI Coding Assistant Complete Guide (2026)

Article stats

Regex Playground

Base64 Encoder / Decoder

UUID Generator

Mistral Medium 3.5 Developer Guide: API, Remote Agents & Pricing 2026

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

NVIDIA Nemotron 3 Nano Omni: Open Multimodal AI Agent Guide 2026

Qwen 3.6 Max Preview: Developer Guide & Benchmarks 2026

Grok Build: xAI's Local-First Coding Agent with 8 Parallel Agents and Arena Mode — Complete Guide (April 2026)