Watch on YouTube
Watch on Vimeo
Michael Witte, a principal architect at Worldwide Technology, discusses the fundamental shift in data center networking required to support large-scale AI and machine learning workloads. The presentation transitions from the basic biological inspiration behind neural networks to the physical and electrical realities that necessitate high-bandwidth GPU fabrics. Witte explains that because neural networks have grown too large to fit within the memory of a single GPU, they must be distributed across thousands of nodes, creating a massive synchronization challenge. This global synchronization acts as a batch job where GPUs must constantly exchange data to align weights and biases, leading to the massive 400Gbps and 800Gbps elephant flows that characterize modern AI traffic.
The technical core of the session focuses on why traditional networking fails under these conditions and how RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCEv2) provides a solution by bypassing the CPU to allow direct memory-to-memory transfers. Witte highlights the necessity of non-blocking, one-to-one subscribed fabrics, as oversubscription leads to packet drops that are catastrophic for UDP-based RoCEv2 traffic. To manage potential congestion without dropping packets, he details the use of Data Center Quantized Congestion Notification (DCQCN), which combines Explicit Congestion Notification (ECN) to throttle senders and Priority Flow Control (PFC) as a hammer to pause traffic when buffers reach critical levels.
Looking toward the future, the summary addresses the evolution of these fabrics into scheduled fabrics and the upcoming standards from the Ultra Ethernet Consortium (UEC). Witte explains that scheduled fabrics improve efficiency by breaking massive elephant flows into smaller flowlets and spraying them across all available paths, though this requires advanced NICs or DPUs to reassemble packets in the correct order. He concludes by emphasizing that as GPU capabilities continue to outpace traditional networking, the industry is moving toward more deterministic, flow-aware scheduling and co-packaged optics to minimize latency and maximize tokens per watt in the race for AI compute efficiency.
Personnel: Michael Witte
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!