Multi-cluster GKE Inference Gateway helps scale AI workloads

With this release, the system uses Kubernetes Custom Resources to manage your distributed inference service. InferencePool resources in each “target cluster” group model-server backends. These backends are exported and become visible as GCPInferencePoolImport resources in the “config cluster.” Standard Gateway and HTTPRoute resources in the config cluster define the entry point and routing rules, directing traffic to these imported pools. Fine-grained load-balancing behaviors, such as using CUSTOM_METRICS or IN_FLIGHT requests, are configured using the GCPBackendPolicy resource attached to GCPInferencePoolImport.

This architecture enables use cases like global low-latency serving, disaster recovery, capacity bursting, and efficient use of heterogeneous hardware.

For more information about GKE Inference Gateway core concepts check out our guide.

Get started today

As you scale your AI inference serving workloads to more users in more places, we’re excited for you to try multi-cluster GKE Inference Gateway. To learn more and get started, check out the documentation:

Multi-cluster GKE Inference Gateway helps scale AI workloads

Gemini supercharges the BigQuery Studio assistant