Watch on YouTube
Watch on Vimeo
We go deep into the fabric of the AI cluster. We’ll discuss why Ethernet has become the definitive backplane for AI workloads. We’ll explore hardware innovations in power efficiency and the protocol optimizations–like Dynamic Load Balancing (DLB) and advanced congestion control–that keep data moving at the speed of thought. This section will cover different networks for AI networking from scale-up to scale-out and scale-across, and discuss optimizations and enhancements to Ethernet standards such as UEC and E-SUN for AI applications.
Tom Emmons emphasizes that as AI networks become business-critical, quality and power efficiency are the primary drivers of architectural decisions. Every problem in an AI network escalates immediately because of the massive financial investments involved, making a reliable network essential. Since power is the fundamental limiting factor for GPU density in a data center, Arista focuses on reducing the network power footprint, ideally to less than 10% of total facility power, through liquid cooling, low-power optics, and high-radix switches that minimize the number of tiers. By reducing tiers, operators save on optics, which are the largest contributors to network power consumption, while also simplifying load balancing and reducing potential congestion points.
The presentation identifies four distinct AI fabrics: front-end, scale-out, scale-across, and scale-up. While scale-out provides the essential east-west connectivity for GPU training, scale-across is becoming increasingly vital for customers who must link geographically dispersed buildings to overcome local power and space constraints. Scale-across networking leverages Arista’s extensive experience in WAN and routing, utilizing deep buffers, encryption, and traffic engineering to manage latency and protect data. Meanwhile, the front-end network mirrors traditional data center designs but demands higher reliability and security to manage the billions of dollars in hardware it connects to the world and local storage resources.
Arista is a vocal advocate for Ethernet as the universal backplane, specifically for the emerging scale-up market where GPU-to-GPU memory copies occur. Through leadership in consortiums like the Ultra Ethernet Consortium (UEC) and the Ethernet for Scale-up Networks (ESun) workgroup, Arista is refining Ethernet to handle 256-byte cache line transactions and packet spraying more efficiently. Emmons posits that the dominance of Ethernet is driven by the industry’s desire for multi-vendor ecosystems and a unified management model. By running a single EOS image across all four fabric types, Arista provides a mature, tested software stack that allows operators to use the same BGP stack and telemetry tools regardless of whether they are managing a local scale-up cluster or a global scale-across network.
Personnel: Tom Emmons
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!