When AI agents make thousands of decisions a day, consistent performance isn’t just a technical detail — it’s a business requirement.
Provisioned Throughput (PT) solves this by giving you reserved resources that guarantee capacity and predictable performance. To help you scale, we are updating PT on Vertex AI with three key improvements:
-
Model diversity: Run the right model for the right job.
-
Multimodal innovation: Process text, images, and video seamlessly.
-
Operational flexibility: Adapt your resources as your agents grow.
In this post, we’ll share the resources available to you today on Vertex AI, and how you can get started.
Expanding support for a diverse model portfolio
A mature AI strategy requires selecting the right model for the specific task. Vertex AI Model Garden, our curated set of 200+ first-party, third-party, and open-source models, makes it easy to use the best resource for your business needs.
We standardized the PT experience across this infrastructure to ensure your capacity strategy remains consistent regardless of the model you deploy.
-
Anthropic integration (private preview): You can now purchase and manage PT for Anthropic models directly from the Vertex AI console, bringing one of the industry’s leading third-party providers into your primary capacity workflow.
-
Open model ecosystem: We have extended PT support to the most popular open-source models, including Llama 4, Qwen3, GLM-4.7, and DeepSeek-OCR, all from the same console experience.
-
Unified governance: Because PT now covers all types of models under a single framework, engineering teams no longer need to design separate reservation or procurement strategies for different model providers.
Powering multimodal innovation
The next wave of AI agents are seeing, hearing, and acting in real time. This movement toward native audio, high-definition video, and complex reasoning creates a massive, non-negotiable demand for reliable compute.
We are ensuring that PT supports these advanced modalities as soon as they reach your production environment.
-
Gemini 3 and Nano Banana: You can now secure dedicated PT for our most capable Gemini 3 models and Nano Banana, our state-of-the-art model for high-fidelity image generation and editing.
-
Gemini Live API: By using PT with Gemini Live API, you get the guaranteed throughput required for high-bandwidth multimodal streams – whether your agents are processing live video feeds or providing real-time audio responses.
-
Veo 3 and 3.1: For video workloads, PT GSU (Generative AI Scale Unit) minimums and incremental limits have been removed for Veo 3 and Veo 3.1. This allows you to purchase the exact amount of capacity you need, making it easier to scale video generation without being forced into high entry-level commitments.
Increasing operational flexibility
Scaling for global production shouldn’t mean sacrificing agility. We provide levers to treat AI compute as a dynamic resource that aligns with actual business cycles.
-
Flexible term lengths: We now offer 1-week PT terms for select models. This allows you to secure guaranteed capacity for high-impact, short-term windows – like a holiday traffic spike or a product launch – without a monthly or yearly commitment.
-
Proactive capacity planning: You can now schedule change orders for your PT requests up to two weeks in advance for select models. This enables your team to automate the ramp-up of resources for known peak events, shifting your strategy from reactive scaling to proactive planning.
-
Maximizing token value: For agentic workloads with long, repetitive contexts, PT now integrates with explicit caching for select models. This delivers reserved performance alongside the significant input cost reductions of caching, ensuring the price of your reservation aligns with actual business value.







