Watch on YouTube
Watch on Vimeo
Ethernet continues to evolve to meet the performance and scaling demands of modern AI networking architectures, progressing from RoCEv2 toward innovations driven by the Ultra Ethernet Consortium (UEC). This presentation discusses these requirements and introduces UEC Specification 1.0, with a focus on scale-out AI designs and the core philosophies shaping its development. Key Ethernet capabilities defined in UEC 1.0, both already implemented and forthcoming, are highlighted to show how Ethernet is being optimized for large-scale AI workloads. Alfred Nothaft explains that the primary challenge in AI fabrics is congestion management, particularly during the synchronization phases of training where thousands of GPUs simultaneously attempt to share massive amounts of gradient data. While legacy tools like ECN and PFC provide basic notification and pause mechanisms, they are often insufficient for the high-velocity requirements of current AI clusters.
The move toward UEC 1.0 represents a fundamental shift from network-centric congestion control to an end-node-centric philosophy. Under the RoCEv2 model, the network infrastructure is largely responsible for managing traffic flows and reacting to congestion. In contrast, UEC shifts the intelligence to the Network Interface Card (NIC) at the GPU endpoint. This allows for more granular, per-packet load balancing rather than traditional flow-based hashing, enabling the NIC to “spray” traffic across multiple paths and dynamically adjust based on real-time telemetry. Furthermore, the UEC transport (UET) is designed to be connectionless and includes native, hardware-level security and encryption from the outset, addressing data sovereignty and privacy concerns that were previously overlooked in backend fabrics.
UEC 1.0 introduces several sophisticated mechanisms to ensure job completion times are minimized. These include packet trimming, which reduces a packet to its header during congestion to signal the source without losing the stream’s context, and advanced in-band telemetry for precise congestion signaling. The specification also features link-layer retransmission to quickly recover from localized bit errors and credit-based flow control to meter traffic before it ever saturates the fabric. By leveraging Ethernet’s vast ecosystem and rapid bandwidth scaling, doubling speeds every two years toward 1.6 terabits, Nokia and the UEC aim to provide a highly flexible, vendor-neutral alternative to proprietary interconnects, supporting everything from local scale-out clusters to geodistributed scale-across environments.
Personnel: Alfred Nothaft
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!