Additionally, when working with LLMs, you may often encounter issues from the underlying APIs such as rate-limiting or downtime. As you move your LLM applications into production it becomes more and more important to safeguard against these. That’s why LangChain introduced the concept of a fallback, an alternative plan that may be used in an emergency. A fallback can be to a different model or even to another LLM provider altogether. Fallbacks can be implemented in code along with backoff and retry methods for greater resilience of your LLM applications.
Another robust option for LLM resiliency is circuit breaking with Apigee. By placing Apigee between a retrieval-augmented generation (RAG) application and LLM endpoints, you can manage traffic distribution and graceful failure handling. Of course, each model will provide a different answer so fallbacks and circuit breaking architecture should be thoroughly tested to ensure it meets your users needs.
Dynamic shared quota
Dynamic shared quota is one way that Google Cloud manages resource allocation for certain models, aiming to provide a more flexible and efficient user experience. Here’s how it works:
Traditional quota vs. dynamic shared quota
- Traditional quota: In a traditional quota system, you’re assigned a fixed limit for a specific resource (e.g., API requests per day, per minute, per region). If you need more capacity, you usually have to submit a quota increase request and wait for approval. This can be slow and inconvenient. Of course, simply having quota allocated does not guarantee capacity, as it is still on-demand and not dedicated capacity. We will talk more about dedicated capacity when we discuss Provisioned Throughput later in this blog.
- Dynamic shared quota: With dynamic shared quota, Google Cloud has a pool of available capacity for a service. This capacity is dynamically distributed among all users who are making requests. Instead of having a fixed individual limit, you draw from this shared pool based on your needs at any given moment.
Benefits of dynamic shared quota
- Eliminates quota increase requests: You no longer need to submit quota increase requests for services that use dynamic shared quota. The system automatically adjusts to your usage patterns.
- Improved efficiency: Resources are used more efficiently because the system can allocate capacity where it’s needed most at any given time.
- Reduced latency: By dynamically allocating resources, Google Cloud can minimize latency and provide faster responses to your requests.
- Simplified management: It simplifies capacity planning because you don’t have to worry about hitting fixed limits.
Dynamic shared quota in action
429 resource exhaustion errors are more likely to occur with asynchronous calls to Gemini with large multimodal input such as large video files. Below is a comparison of model performance of Gemini-1.5-pro-001 with traditional quota versus Gemini-1.5-pro-002 with dynamic shared quota. We can see even without retry (not recommended) the second-generation Gemini Pro model outperforms the previous-generation model because of dynamic shared quota.