What’s new in GKE at Next 26

This week at Google Cloud Next ‘26, we are sharing the evolution of Google Kubernetes Engine (GKE), delivering leading performance, efficiency, security, and scale for your most demanding and complex workloads, and the next generation of AI and agentic applications.

Why it matters: Kubernetes has rapidly become the operating system for the AI era, with GKE now powering AI workloads for all of our top 50 customers on the platform, including the largest frontier model builders. We are witnessing a massive acceleration in enterprise AI. In just a few months, the number of multi-agent AI workflows has surged by 327%. At the same time, 66% of organizations rely on Kubernetes to power generative AI apps and agents.

This new era of autonomous agents operating at massive scale requires a foundational change in how we manage infrastructure — a change that is more demanding than the shift from stateless to stateful applications.

What’s new:

GKE Agent Sandbox: Secure, highly scalable, low-latency agent infrastructure
GKE hypercluster: A single, conformant GKE control plane to manage millions of accelerators across Google Cloud regions
Improved inference performance: Foundational enhancements to GKE Inference Gateway and KV Cache management
Reinforcement learning (RL) enhancers: Native capabilities to relieve bottlenecks that throttle accelerator utilization
Scaling on custom metrics: Support for intent-based autoscaling on triggers besides CPU and memory

Read on for details about these GKE announcements.

GKE Agent Sandbox: Accelerating the agentic era

As AI evolves from simple conversational chatbots to entire ecosystems of proactive, autonomous agents, the underlying infrastructure must adapt to handle hundreds or thousands of agents collaborating with workers to plan, evaluate, and execute complex tasks. At scale, infrastructure performance, responsiveness, and rigorous security are essential.

We are excited to announce GKE Agent Sandbox, the industry’s most scalable and low-latency agent infrastructure. Built with gVisor kernel isolation — the same technology securing Gemini — Agent Sandbox allows you to safely execute untrusted code, tools, and entire agents without sacrificing performance. GKE provides leading speed and efficiency for fully isolated agents with 300 sandboxes per second at sub-second latency and up to 30% better price-performance when running on Axion compared to other hyperscale clouds.

Lovable empowers anyone to build apps and websites — with builders creating 200,000+ new projects daily. Lovable runs these AI-generated applications in GKE Agent Sandboxes because of the fast startup, fast scaling and secure isolation.

GKE’s cutting-edge sandboxing capabilities allow us to reliably scale to hundreds of secure sandboxes per second, ensuring we can seamlessly empower builders, even during massive, unpredictable demand.” – Fabian Hedin, Co-founder, Lovable

GKE hypercluster redefines the scalability ceiling

As foundational AI models grow exponentially and accelerators remain in high demand, organizations resort to fracturing Kubernetes compute infrastructure into hundreds of disconnected clusters, which can create a massive operational burden. To help, we’re announcing the private GA of GKE hypercluster, which allows a single, Kubernetes conformant GKE control plane to manage a million chips distributed across 256,000 nodes — spanning multiple Google Cloud regions. With the GKE hypercluster, widely distributed infrastructure becomes a single, unified capacity reserve that spans geographical locations.

To scale globally without compromising security, GKE hypercluster relies on Google’s Titanium Intelligence Enclave, a software-hardened security engine that delivers private AI compute. This “no-admin-access” model provides hardware-attested, pod-level isolation, so that proprietary model weights and prompts remain cryptographically sealed from platform administrators and infrastructure layers.

Supercharging state-of-the-art inference

Achieving frontier inference requires months of complex performance tuning. To reduce this heavy lifting, GKE now slashes your “time to SOTA” across TPUs and GPUs to mere minutes. We do this with new capabilities:

ML-driven Predictive Latency Boost in GKE Inference Gateway, which can reduce time-to-first-token latency by up 70% by replacing heuristic guesswork with real-time capacity-aware routing — no manual tuning required.
Automatic KV Cache storage tiering across RAM, Local SSD, and GCS/Lustre solves long-context memory bottlenecks. Offloading KV Cache to RAM yielded a more than 40% TTFT reduction and a 50% throughput gain for a 10K system prompt length. Offloading KV Cache to Local SSD yielded an almost 70% throughput improvement for a 50K system prompt length. Learn more about these benchmarks in the llm-d Offloading Prefix Cache to Shared Storage guide.

Built as part of a layered composable suite, these new GKE capabilities leverage llm-d, now an official CNCF Sandbox project. To give you maximum flexibility, we’ve partnered closely with NVIDIA to seamlessly integrate Dynamo for scaling massive Mixture-of-Experts (MoE) models. Whichever tools you choose, GKE provides the highly-optimized, flexible infrastructure you need to safely run any frontier AI workload — including the advanced agentic capabilities of the newly announced Gemma 4.

Eliminating RL compute bottlenecks

Reinforcement learning (RL) is a key driver of AI compute demand and RL jobs involve sequential processing for sampling, reward, and training that can leave GPU and TPU accelerators idle between these RL steps. To streamline RL, we are adding new GKE capabilities in preview:

RL Scheduler solves for the “straggler effect” and inter-batch tail latency, maximizing throughput via intelligent routing.
RL Sandbox provides kernel-level isolation for tool-calling and reward evaluation with millisecond-scale provisioning. Easy integration with RL sampling and reward steps.
RL Observability and Reliability dashboards offer the deep visibility required to troubleshoot and optimize the entire RL loop instantly, out of the box.

Review the RL on GKE recipe, specifically the implementations for Verl and NeMo RL.

Intent-based autoscaling on custom metrics

Traditionally, scaling AI workloads based on application health has imposed a “custom metric tax.” To scale the system on anything but basic compute or memory utilization, organizations have to manage complex monitoring systems and IAM roles. This creates operational risk: if your external observability stack fails, your autoscaling breaks along with it.

Intent-based autoscaling eliminates this overhead via native custom metrics support for GKE’s Horizontal Pod Autoscaler (HPA). This agentless architecture bypasses external dependencies by sourcing metrics directly from Pods, hardening reliability while cutting costs. Crucially, it drops reaction times from 25 seconds to just 5 seconds—a 5x performance gain for near-instantaneous infrastructure elasticity.

New workloads, same mission

For over a decade, GKE has set the standard for scalable infrastructure. As we enter the era of agentic and autonomous AI, our mission remains the same: eliminating operational friction so you can focus on innovation. The capabilities we are announcing at Next ‘26 — from GKE hypercluster and the Agent Sandbox, to ultra-fast inference and intent-based autoscaling — give you the secure, efficient, and powerful engine you need to succeed with your ambitious AI workloads. To learn more about using GKE for your AI workloads, check out GKE Inference Quickstart.

What’s new in GKE at Next 26

What’s new for Cloud Run at Next ‘26

Leave a Reply Cancel reply

You might also like

What’s new in GKE at Next 26

What’s new for Cloud Run at Next ‘26

Startups are building the agentic future with Google Cloud

Looker updates for agentic BI at Next ‘26

Unveiling new BigQuery capabilities for the agentic era

Unify analytical and operational data for AI

Stay Connected

Welcome Back!

Create New Account!

Retrieve your password