How Protective ReRoute improves network resilience

Cloud infrastructure reliability is foundational, yet even the most sophisticated global networks can suffer from a critical issue: slow or failed recovery from routing outages. In massive, planetary-scale networks like Google’s, router failures or complex, hidden conditions can prevent traditional routing protocols from restoring service quickly, or sometimes at all. These brief but costly outages — what we call slow convergence or convergence failure — critically disrupt real-time applications with low tolerance to packet loss and, most acutely, today’s massive, sensitive AI/ML training jobs, where a brief network hiccup can waste millions of dollars in compute time.

To solve this problem, we pioneered Protective ReRoute (PRR), a radical shift that moves the responsibility for rapid failure recovery from the centralized network core to the distributed endpoints themselves. Since putting it into production over five years ago, this host-based mechanism has dramatically increased Google’s network’s resilience, proving effective in recovering from up to 84%1 of inter-data-center outages that would have been caused by slow convergence events. Google Cloud customers with workloads that are sensitive to packet loss can also enable it in their environments — read on to learn more.

The limits of in-network recovery

Traditional routing protocols are essential for network operation, but they are often not fast enough to meet the demands of modern, real-time workloads. When a router or link fails, the network must recalculate all affected routes, which is known as reconvergence. In a network the size of Google’s, this process can be complicated by the scale of the topology, leading to delays that range from many seconds to minutes. For distributed AI training jobs with their wide, fan-out communication patterns, even a few seconds of packet loss can lead to application failure and costly restarts. The problem is a matter of scale: as the network grows, the likelihood of these complex failure scenarios increases.

Protective ReRoute: A host-based solution

Protective ReRoute is a simple, effective concept: empower the communicating endpoints (the hosts) to detect a failure and intelligently re-steer traffic to a healthy, parallel path. Instead of waiting for a global network update, PRR capitalizes on the rich path diversity built into our network. The host detects packet loss or high latency on its current path, and then immediately initiates a path change by modifying carefully chosen packet header fields, which tells the network to use an alternate, pre-existing path.

This architecture represents a fundamental shift in network reliability thinking. Traditional networks rely on a combination of parallel and series reliability. Serialization of components tends to reduce the reliability of a system; in a large-diameter network with multiple forwarding stages, reliability degrades as the diameter increases. In other words, every forwarding stage affects the whole system. Even if a network stage is designed with parallel reliability, it creates a serial impact on the overall network while the parallel stage reconverges. By adding PRR at the edges, we treat the network as a highly parallel system of paths that appear as a single stage, where the overall reliability increases as the number of available paths grows exponentially, effectively circumventing the serialization effects of slow network convergence in a large-diameter network. The following diagram contrasts the system reliability model for a PRR-enabled network with that of a traditional network. Traditional network reliability is in inverse proportion to the number of forwarding stages; with PRR the reliability of the same network is in direct proportion to the number of composite paths, which is exponentially proportional to the network diameter.

How Protective ReRoute improves network resilience

How Waze keeps traffic flowing with Memorystore