This necessitates an exponential increase in required network bandwidth, with strict bounds on delay (e.g., tail latency) to accommodate AI workloads’ peculiar traffic patterns, which are characterized by sensitivity to performance variation and synchronized bursts, i.e., intense, coordinated, millisecond-level traffic spikes. Furthermore, since large-scale training jobs are uniquely vulnerable to failures and performance stragglers, maintaining high reliability and predictable performance is absolutely essential.
To address the scale, low latency, and high predictability that modern AI workloads require — as well as protection from extreme bursts — we’ve adopted a “campus as a computer” philosophy, decoupling our network into three distinct domains:
-
a scale-up domain for intra-pod connectivity
-
a dedicated east-west scale-out accelerator fabric
-
the Jupiter frontend network for north-south compute and storage access
This decoupled architecture provides three strategic advantages: it allows domains to evolve independently for faster innovation; provides a non-blocking scale-out network with massive training bandwidth; and helps ensure the network can be co-designed in lockstep with new ML accelerators, for superior hardware support.
Recently, we unveiled Virgo Network, our scale-out data center fabric specifically engineered for modern AI. Virgo utilizes high-radix switches and a flat, two-layer non-blocking topology to provide massive bisection bandwidth, while minimizing latency by reducing network tiers. Its multi-planar design, featuring independent control domains for each plane, provides hardware-level resilience and fault isolation. Furthermore, Virgo can expand across multiple data centers, removing physical building limitations and enabling flexible AI compute scaling.






