Fueling agentic logic and reinforcement learning with Axion, Intel, and AMD
While GPUs and TPUs are great for training and serving AI models, they need to be complemented with high-performance CPU-based services to handle the complex logic, tool-calls, and feedback loops that surround the core AI model. Our new Axion-powered N4A CPU instances deliver outstanding price-performance for these agent runtimes. In fact, GKE Agent Sandbox with Google Axion N4A offers up to 30% better price-performance than agent workloads on other hyperscalers. This efficiency extends across our entire portfolio, including our 4th generation Compute Engine VM families, powered by the latest x86 instances from Intel and AMD. These are specifically optimized for the broadest range of RL tasks, such as RL reward calculation, agent orchestration, and nested visualization, providing the optimal capabilities for every AI workload.
Virgo Network for data center scale-out fabric
As part of AI Hypercomputer, the Virgo Network is designed to meet the demanding requirements of modern large-scale AI workloads. Its collapsed fabric architecture with 4x the bandwidth of previous generations eliminates the “scaling tax” to deliver staggering peak computing power. This capacity helps the most ambitious AI workloads scale with near-linear efficiency.
With Virgo Network and TPU 8t, we can connect 134,000 TPUs into a single fabric in a single data center, and connect more than one million TPUs across multiple data center sites into a training cluster — essentially transforming globally distributed infrastructure into one seamless supercomputer.
We are also making Virgo Network available for A5X (powered by NVIDIA Vera Rubin NVL72), supporting up to 80,000 GPUs in a single data center, and up to 960,000 GPUs across multiple sites.
Storage: Minimizing data bottlenecks
A massive compute cluster is only as effective as the storage system feeding it data. To ensure storage is not a bottleneck while making compute faster, we are delivering four key storage advancements that let you:
-
Accelerate training and inference: Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth — a 10x improvement over last year and up to 20x faster than other hyperscalers. We’ve also increased its capacity to 80 petabytes. These advancements are powered by our new C4NX instances and Hyperdisk Exapools.
-
Minimize latency: Managed Lustre can leverage new TPUDirect and RDMA to allow data to bypass the host, moving directly to the accelerators. By removing this processing overhead, your AI agents can respond with the near-instant speed users need.
-
Maintain peak utilization for training: Rapid Buckets on Google Cloud Storage transforms object storage with sub-millisecond latency and 20 million operations per second. This helps ensure large-scale training checkpoints and recoveries happen near-instantly, allowing your accelerators to maintain 95% utilization or higher, accelerating training cycles, while also providing cost-effective utilization of valuable TPUs and GPUs.
-
Build custom solutions: For ISVs and organizations that want to build storage solutions, we are launching the Z4M instance, specifically engineered for customers who want to integrate trusted parallel file systems like Vast Data or Sycomp. Each Z4M instance scales to a massive 168 TiB of local SSD capacity and can be deployed in RDMA clusters of thousands of machines.
These new storage options provide a comprehensive storage portfolio, giving you the raw power of the AI Hypercomputer stack with optimal storage services for each use-case.
GKE: Orchestration for agent-native workloads
In the agentic era, intelligence is only as effective as the speed at which it can be scaled. So, we’ve transformed GKE to serve as the premier orchestration engine for agent-native workloads.
Reducing latency across the stack
To support responsive agentic responses, we optimize every millisecond of the start-up and scale-out process. By streamlining how infrastructure responds to surges in demand, GKE ensures that your agents are ready the moment a user engages with the system. New in GKE are:
-
Accelerated node and pod startup: GKE nodes now start up to 4x faster, while pod startup times have been slashed by up to 80%.
-
Rapid model loading: Leveraging the run:AI Model Streamer and Rapid Cache in Google Cloud Storage, models now load 5x faster, removing a traditional storage bottleneck.
Intelligent routing with AI-powered Inference Gateway
Building on last year’s introduction of GKE Inference Gateway, we are using “AI for AI” to solve the complexities of serving at scale.
Inference Gateway’s new predictive latency boost replaces heuristic guesswork with machine learning-driven, real-time capacity-aware routing. This intelligent orchestration cuts time-to-first-token (TTFT) latency by more than 70% without manual tuning. For businesses, this translates directly into more natural voice conversations and smooth, real-time interactions across a range of use cases.
Inference Gateway can be deployed alongside llm-d, a Kubernetes-native high-performance distributed LLM inference framework, which was recently accepted as a Cloud Native Computing Foundation (CNCF) Sandbox project. Google Cloud is proud to be a founding contributor to llm-d alongside Red Hat, IBM Research, CoreWeave, and NVIDIA, uniting around a clear, industry-defining vision: any model, any accelerator, any cloud.






