Cluster reliability for trillion parameter models on TPUs

Frontier AI models have redefined the unit of compute. At trillion-parameter scale, AI training requires thousands of interconnected components, orchestrated in industrial-scale deployments to operate as a single, massive entity.

Likewise, when it comes to reliability, aggregate infrastructure availability is what matters. Yet for almost two decades, instance-level reliability has been the cloud standard. Designed for microservices and horizontally scalable applications, instance-level reliability treats infrastructure as a collection of small independent units. This model is fundamentally inadequate for large-scale AI workloads.

We believe reliability must shift from an instance- to a cluster-level model.

For over a decade, Google has operated Tensor Processing Unit (TPU) clusters at scale, achieving reliability that meets the architectural requirements of modern AI workloads. In this blog, we’re presenting our cluster-level reliability framework for Google Cloud TPUs that focuses on collective performance at the superpod level, and one we use internally within Google to build the world’s most advanced AI models. This framework is the operational standard for TPUs in production today, and serves as the architectural blueprint for our recently announced eighth-generation TPUs.

Reliability for AI supercomputers

TPU superpods consist of thousands of chips arranged into cubes (64 TPUs), with high-speed Inter-Chip Interconnect (ICI) links connecting every chip within a cube and a dynamically configurable Optical Circuit Switch (OCS) network connecting all cubes to form a superpod.

For system-wide training progress, we must maximize the number of fully healthy cubes within a superpod. Because the performance of AI models relies on high-bandwidth, low-latency communication, every chip and ICI link within a cube must be operational for that unit to contribute to the training progress. Driven by these architectural realities, our cluster-level framework helps define how the industry can achieve reliability in the AI era, moving from instance-level reliability to availability of scale.

Deep dive: The mathematics of availability at scale

Instance-level reliability models are often deterministic, but industrial-scale AI deployments require a probabilistic approach over thousands of chips. In a traditional setup, you might track the Mean Time Between Failures (MTBF) of a single chip. However, at the scale of frontier AI, the cluster-level MTBF drops sharply as the number of components grows.

To visualize how quickly scaling can erode confidence, we can look at simple bounds like Markov’s inequality.

Cluster reliability for trillion parameter models on TPUs

Cloud Storage Rapid turbocharges object storage for AI, analytics