For seven generations of Google’s Tensor Processing Units, the same chip handled both training large models and running them in production. That approach made sense when models were smaller and the two workloads had broadly similar compute profiles. At Google Cloud Next 2026 on April 22, Google announced that this era is over.
Google unveiled its eighth-generation TPUs as two separate chips: the TPU 8t, purpose-built for training, and the TPU 8i, purpose-built for inference. Each is independently optimized for the radically different computational demands of its workload. The result is up to three times faster model training, 80% better performance per dollar for inference, and the ability to run over one million TPUs in a single logical cluster via the new Virgo Network.
This guide covers what changed, why the architectural split matters, what the actual specifications look like, how these chips compare to NVIDIA’s offerings, and what developers should do right now.
The Problem with One Chip for Two Workloads
Training a large language model and running it in production are fundamentally different computational problems, and that gap has been widening for three years.
Training is throughput-bound. You want to process as many tokens, gradients, and weight updates as possible per second, across as many chips as you can synchronize efficiently. You accept longer job runtimes — days or weeks — in exchange for maximizing throughput. The critical bottlenecks are inter-chip communication bandwidth, memory bandwidth for reading and writing large weight tensors, and scale-out efficiency as you add more chips to the cluster.
Inference is latency-bound. When a user submits a query, the first token must arrive within milliseconds. Subsequent tokens stream at a rate that feels natural, and the system must serve thousands of concurrent users without degrading. The critical bottlenecks are on-chip SRAM (to avoid slow HBM reads on frequently accessed attention patterns), per-request collective communication latency, and network topology optimized for scatter/gather rather than all-reduce.
A single chip optimized for both workloads means accepting compromises on both ends. Training wants more scale-up bandwidth and interchip interconnect; inference wants more on-chip SRAM and a topology that eliminates latency on individual requests. By splitting the two into dedicated chips, Google can optimize each without compromise — and the specifications of the 8t and 8i make clear just how different those optimizations actually are.
TPU 8t: Built for Training at Scale
Specifications
The TPU 8t is designed for one purpose: training the largest models that exist, at the fastest possible throughput.
- HBM: 216 GB per chip at 6.5 TB/s memory bandwidth
- On-chip SRAM: 128 MB
- Compute: Up to 12.6 petaFLOPS of 4-bit floating-point (FP4) per chip
- Chip-to-chip interconnect: 19.2 Tbps — double the previous generation (Ironwood)
- Superpod scale: Up to 9,600 chips with 2 petabytes of aggregate shared HBM
The chip-to-chip interconnect doubling is the most architecturally significant number. Synchronizing gradients across thousands of chips during backpropagation is one of the hardest scaling bottlenecks in distributed training. The previous generation’s interconnect became the ceiling on how efficiently a large superpod could run all-reduce operations. At 19.2 Tbps, that ceiling has been substantially raised.
Superpod Scale
A single TPU 8t superpod now accommodates up to 9,600 chips with two petabytes of aggregate shared high-bandwidth memory. That is a cluster large enough to train models with hundreds of billions of parameters without inter-node communication becoming the primary bottleneck. Google claims this configuration delivers up to three times faster model training compared with Ironwood, and up to 2x better performance per watt.
To put that in practical terms: a training run that took three weeks on the previous generation completes in approximately one week on TPU 8t. A pre-training job that cost $3M on Ironwood would cost under $1.5M for the same result on TPU 8t, assuming comparable pricing. The exact per-chip-hour rates have not yet been announced, but the 80% performance-per-dollar improvement is Google’s stated number.
TPU 8i: Built for Inference at Scale
Specifications
The TPU 8i is designed for one purpose: serving millions of concurrent inference requests with the lowest possible latency.
- On-chip SRAM: 384 MB — three times more than the TPU 8t
- HBM: 288 GB per chip at 8.6 TB/s memory bandwidth
- Compute: 10.1 petaFLOPS of FP4 per chip
- Pod scale: 1,152 chips per pod
- New components: Collectives Acceleration Engine (CAE) + Boardfly network topology
The 384 MB of on-chip SRAM is the headline number. On-chip SRAM is an order of magnitude faster to access than HBM — lower latency, no bus contention, no DRAM timing overhead. For a large model serving billions of daily queries, the fraction of the model that fits in on-chip SRAM determines tail latency. Attention layers and frequently accessed MLP blocks can live in SRAM during serving, avoiding the slow trip to HBM on every request.
The TPU 8i also carries more HBM bandwidth (8.6 TB/s) than the TPU 8t (6.5 TB/s), despite having less total HBM capacity. This reflects inference’s memory-bandwidth-bound nature: inference reads weights once per forward pass and is often not reusing cached values the way training does, so raw bandwidth matters more than total capacity.
Boardfly and the Collectives Acceleration Engine
Google introduced two new components for the TPU 8i that have no equivalent in previous generations:
Boardfly is a new network topology specifically designed for serving workloads. Training topologies optimize for all-reduce operations across large superpods, where all chips share gradient updates simultaneously. Inference has a different pattern: each incoming request touches a specific subset of chips, needs a fast response, and then the chips move to the next request. Boardfly minimizes the communication latency for these scatter/gather patterns, reducing the per-request overhead that accumulates at high concurrency.
The Collectives Acceleration Engine (CAE) offloads collective communication operations from the main tensor cores entirely. Without CAE, collective operations (like the all-gather needed during tensor-parallel inference) stall the compute units while they wait for data. CAE runs these collectives on dedicated silicon in parallel with tensor core computation, reducing effective inference latency by eliminating those stalls.
The Virgo Network: Connecting It All
Both chips run on Google’s new Virgo Network, a megascale data center fabric announced alongside the TPUs at Cloud Next 2026. Virgo is designed to allow over one million TPUs to operate as a single logical cluster — not across the public internet, but within Google’s private data center network with performance characteristics closer to on-chip interconnect than to WAN networking.
From a developer perspective, Virgo is largely invisible. You do not configure it; you benefit from it through Google Cloud’s AI Hypercomputer platform. But it matters because it is the reason the TPU 8t superpod’s 9,600-chip scale is actually usable in practice. When the inter-pod interconnect is fast enough, scaling from 1,000 chips to 9,600 chips delivers near-linear throughput gains. When it is not, scaling efficiency degrades rapidly and the additional chips produce diminishing returns.
Google claims that Virgo-backed AI Hypercomputer clusters can achieve linear scaling efficiency up to the full one-million-chip scale, which would be an unprecedented result if it holds in production workloads.
What This Means for AI Developers
The TPU 8t/8i split has different implications depending on which part of the AI lifecycle you care about.
If You Train Foundation Models or Run Large Fine-Tunes
The TPU 8t’s 3x training throughput is the most cost-relevant announcement for your workload. Timeline compression at this scale has compounding effects: faster iteration means more experiments per dollar, which means better models per unit of budget. Teams training at the 100B+ parameter scale on Google Cloud should be on the TPU 8t waitlist now.
If You Run Production Inference
The TPU 8i’s 384 MB of on-chip SRAM is the number that matters most. If you serve Gemini-class or similar large models, first-token latency is directly tied to how much of the model’s hot path fits in fast memory. At 3x the on-chip SRAM of the training chip, the 8i can hold substantially more of the model in fast memory than previous generations. Pair that with the Boardfly topology’s lower per-request overhead, and p99 latency at high concurrency improves meaningfully.
If You Use Google Cloud Managed AI Services
Google runs Gemini API, Vertex AI predictions, and its other managed AI services on its own TPU fleet. As the 8th generation rolls out internally, the services you call will become faster and cheaper without any changes on your end. The 16 billion tokens per minute that Google Cloud already processes will scale further as the new generation comes online.
If You Are Currently on NVIDIA GPUs
This announcement is not an immediate reason to switch. The H200 NVL and upcoming Blackwell Ultra are well-matched competitors, and the CUDA ecosystem’s maturity, library coverage, and developer familiarity remain real advantages. PyTorch on TPUs has improved significantly in 2025–2026 but is not identical to native CUDA performance. For teams already heavily invested in CUDA tooling, the switching cost is real.
That said, the TPU 8t/8i performance-per-dollar numbers are competitive in a way that previous generations were not always clearly competitive. If your team is starting a new large training run or building a new inference-at-scale system and is not yet locked into NVIDIA, it is worth comparing the current pricing once Google Cloud rates are published.
Google vs. NVIDIA: What the Chip War Looks Like in 2026
The framing of Google’s TPU announcements as “competing with NVIDIA” is accurate but slightly incomplete. Google is competing with NVIDIA for cloud AI workloads that Google Cloud can capture directly. But it is also competing against the scenario where enterprises build private GPU clusters rather than using managed cloud at all.
NVIDIA’s advantage is ecosystem depth. CUDA, cuDNN, NVLink, and the broader software stack have years of tooling, benchmarks, and developer familiarity. The TPU ecosystem requires JAX or TensorFlow for optimal performance, and while PyTorch via XLA has improved substantially, there is still a learning curve for teams moving from pure CUDA workflows.
Google’s advantage at the top end is scale. If you need a 9,600-chip training run with 2 petabytes of shared HBM, operating on a single logical cluster via a high-bandwidth fabric, the number of technically viable options in the world is very small. Google Cloud with AI Hypercomputer is one of them. For the organizations running the largest training workloads on Earth — which increasingly means the organizations building the next generation of frontier models — Google has made a compelling case that TPU 8t is the best option available.
How to Get Access
Both chips are available through Google Cloud’s AI Hypercomputer platform. Here is the current access path:
- Interest registration: Google has opened a sign-up page at
cloud.google.com/resources/tpu-interestfor both the TPU 8t and 8i - Priority access: Existing Cloud TPU quota holders will receive priority access as the generation rolls out
- GA timeline: General availability for both chips is expected in the second half of 2026
- Supported frameworks: TensorFlow, JAX, and PyTorch via XLA are all supported with the full managed stack
- Indirect access now: Gemini 3.1 Ultra (which runs on TPU 8i internally) is available via the Gemini API today
The Bottom Line
The TPU 8t and TPU 8i represent a genuine architectural maturation in Google’s chip strategy. The decision to split training and inference into dedicated hardware is not about throwing more transistors at the problem — it is a recognition that the two workloads have diverged to the point where the compromises of a unified chip are no longer acceptable at scale.
The numbers make the case concretely: 3x faster training, 80% better performance per dollar, 384 MB of on-chip SRAM for inference, and a new network topology designed specifically for the communication patterns of serving at million-query-per-second scale. These are not marginal improvements on existing chips — they are the result of several years of workload specialization research finally reflected in silicon.
For developers, the practical signal is simple: training and inference now have different optimal hardware, and the gap will only grow. Teams that size their Google Cloud infrastructure without distinguishing between the two workloads will leave performance and budget on the table. The era of one chip for all AI is ending. Build your infrastructure accordingly.
Written by
Anup Karanjkar
Expert contributor at WOWHOW. Writing about AI, development, automation, and building products that ship.
Ready to ship faster?
Browse our catalog of 1,800+ premium dev tools, prompt packs, and templates.
Comments · 0
No comments yet. Be the first to share your thoughts.