Monitoring your investment
You can closely monitor your Provisioned Throughput usage using Cloud Monitoring metrics on the aiplatform.googleapis.com/PublisherModel resource. Key metrics include:
- /dedicated_gsu_limit: Your dedicated limit in Generative Scale Units (GSUs).
- /consumed_token_throughput: Your actual throughput usage, accounting for the model’s burndown rate.
- /dedicated_token_limit: Your dedicated limit measured in tokens per second.
This allows you to ensure you are getting the value you paid for and helps you right-size your commitment over time. To learn more about PT on Vertex AI, visit our guide here.
Building your recipe: Combining options for optimal results
Consider a workload with a predictable daily baseline, expected peaks, and the occasional unexpected spike. The optimal recipe would be:
-
Provisioned Throughput: Cover your predictable, mission-critical baseload. This gives you an availability SLA for the core of your application.
-
Priority PayGo: Use this to handle predictable peaks that rise above your PT commitment or for important traffic that is less frequent. This acts as a cost-effective insurance policy against 429 errors for your most important variable traffic.
-
Standard PayGo (within tier limit): This forms your foundation for general, non-critical traffic that fits comfortably within your organization’s usage tier.
-
Standard PayGo (opportunistic bursting): For non-critical, latency-insensitive jobs (like batch processing), you can rely on the best-effort bursting of the standard PayGo model. If some of these requests are throttled, it won’t impact your core user experience, and you don’t pay a premium for them.
By understanding and combining these powerful tools, you can move beyond simply managing costs and start truly optimizing your GenAI strategy for the perfect balance of performance, availability, and value.
Extra bonus: Batch API and Flex PayGo
Starting with the Batch API, not every LLM request needs a sub-second time-to-first-token (TTFT). If a user is chatting with a customer service bot, low latency is critical. But if you are classifying millions of support tickets from last month, running evaluations, or generating daily summary reports, nobody is sitting at a screen waiting for a real-time stream. This is where the Gemini Batch API becomes your best friend. Customers can bundle up a massive payload of requests into a single file and submit it asynchronously. The infrastructure processes these workloads during off-peak windows or when idle compute capacity is available. The target turnaround time is 24 hours, though in practice, it is typically much faster. By trading immediate execution for asynchronous processing, you get a 50% discount on standard token costs.
While Batch handles your offline heavy lifting, your live apps still need real-time computation. But not all requests are latency-driven and customers might accept to wait a little longer to get a discount on the standard token costs. Flex PayGo provides a highly cost-effective way to access Gemini models, offering a 50% discount compared to Standard PayGo. Optimized for non-critical workloads that can accommodate response times of up to 30 minutes, it allows for seamless transitions between Provisioned Throughput (PT), Standard PayGo, and Flex PayGo with minimal code changes. Ideal use cases include:
-
Offline analysis of text and multimodal files.
-
Model quality evaluation and benchmarking.
-
Data annotation and labeling.
-
Automated product catalog generation.
Get started
-
Explore the Models in Vertex AI: Discover the full range of Google’s first-party models as well as over 100 open-source models available in the Model Garden
-
Dive deeper into the documentation: For the most up-to-date technical details, thresholds, and code samples, the official Vertex AI documentation is your source of truth.
-
Review pricing details: Get a detailed breakdown of token costs, Provisioned Throughput pricing, and the latest discounts for Batch and Flex APIs on the Vertex AI pricing page.





