Securing AI inference on GKE with Model Armor

Enterprises are rapidly moving AI workloads from experimentation to production on Google Kubernetes Engine (GKE), using its scalability to serve powerful inference endpoints. However, as these models handle increasingly sensitive data, they introduce unique AI-driven attack vectors — from prompt injection to sensitive data leakage — that traditional firewalls aren’t designed to catch.

Prompt injection remains a critical attack vector, so it’s not enough to hope that the model will simply refuse to act on the prompt. The minimum standard for protecting an AI serving system requires fortifying the service against adversarial inputs and strictly moderating model outputs.

We also recommend developers use Model Armor, a guardrail service that integrates directly into the network data path with GKE Service Extensions, to implement a hardened, high-performance inference stack on GKE.

The challenge: The black box safety problem

Most large language models (LLMs) come with internal safety training. If you ask a standard model how to perform a malicious act, it will likely refuse. However, solely relying on this internal safety presents three major operational risks:

Opacity: The refusal logic is baked into the model weights, making it opaque and beyond your direct control.
Inflexibility: You can not easily tailor refusal criteria to your specific risk tolerance or regulatory needs.
Monitoring difficulty: A model’s internal refusal typically returns a HTTP 200 OK response with text saying “I cannot help you.” To a security monitoring system, this looks like a successful transaction, leaving security teams blind to active attacks.

The solution: Decoupled security with Model Armor

Model Armor addresses these gaps by acting as an intelligent gatekeeper that inspects traffic before it reaches your model and after the model responds. Because it is integrated at the GKE gateway, it provides protection without requiring changes to your application code.

Key capabilities include:

Proactive input scrutiny: It detects and blocks prompt injection, jailbreak attempts, and malicious URLs before they waste TPU/GPU cycles.
Content-aware output moderation: It filters responses for hate speech, dangerous content, and sexually explicit material based on configurable confidence levels.
DLP integration: It scans outputs for sensitive data (PII) using Google Cloud’s Data Loss Prevention technology, blocking leakage before it reaches the user.

Architecture: High-performance security on GKE

We can construct a stack that balances security with performance by combining GKE, Model Armor, and high-throughput storage.

Securing AI inference on GKE with Model Armor

Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment