Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

5. Deploy the PVC Evictor

PVC Evictor Overview

Architecture & Role

The llmd_fs_backend connector offloads KV-cache blocks to Lustre but does not natively delete old cache files. Over time, the cache will fill the shared filesystem. The PVC Evictor acts as an external garbage collector that continuously monitors disk usage and evicts least-recently-used (LRU) files to maintain healthy storage headroom.

Scaling & Sharding

The PVC Evictor supports sharding and can be scaled to multiple replicas to match the capacity and performance of your Lustre instance. As a rule of thumb, you should deploy 1 evictor replica for each 72 TB of Lustre capacity to distribute the eviction load effectively without overwhelming the metadata servers.

For large-scale deployments, the evictor can be configured to run with multiple shards. When running in multi-replica mode, the workload is partitioned across pods, with each pod managing a specific shard of the cache namespace. This prevents redundant metadata scans and race conditions.

High-Performance Resource Requirements

Running the evictor at high scale (e.g., with 16 parallel crawler processes) requires significant CPU and memory resources to handle the rapid scanning and queue management of millions of files. Ensure that the pods are provisioned with sufficient resources (e.g., 12 CPU requests and 8Gi Memory requests) and scheduled on appropriate node types (such as c4-standard-16).

PVC Evictor Deployment Steps

The PVC Evictor is deployed via Helm using the chart located in kv_connectors/pvc_evictor/helm.

Step 5a: Create a Dedicated Node Pool for the Evictor

Running the evictor at high scale requires significant CPU and memory. First, create a dedicated node pool using a high-performance machine type (such as c4-standard-16) to accommodate the 12 CPU and 8Gi memory requests needed per pod.

Scaling LLM Inference: Multi-Node KV Cache Offloading with GKE & Managed Lustre

Beyond Static Prompts: Building Scale-Proof, Polymorphic Multi-Agent Systems with Google’s ADK