From Neural Networks to GPU Fabrics – Networking for Modern AI:ML Infrastructure

This video is part of the appearance, "Networking Field Day 40 Community Sessions". It was recorded as part of Networking Field Day 40 at 14:00-14:45 on April 10, 2026.

Watch on YouTube
Watch on Vimeo

Michael Witte, a principal architect at Worldwide Technology, discusses the fundamental shift in data center networking required to support large-scale AI and machine learning workloads. The presentation transitions from the basic biological inspiration behind neural networks to the physical and electrical realities that necessitate high-bandwidth GPU fabrics. Witte explains that because neural networks have grown too large to fit within the memory of a single GPU, they must be distributed across thousands of nodes, creating a massive synchronization challenge. This global synchronization acts as a batch job where GPUs must constantly exchange data to align weights and biases, leading to the massive 400Gbps and 800Gbps elephant flows that characterize modern AI traffic.

The technical core of the session focuses on why traditional networking fails under these conditions and how RDMA (Remote Direct Memory Access) over Converged Ethernet (RoCEv2) provides a solution by bypassing the CPU to allow direct memory-to-memory transfers. Witte highlights the necessity of non-blocking, one-to-one subscribed fabrics, as oversubscription leads to packet drops that are catastrophic for UDP-based RoCEv2 traffic. To manage potential congestion without dropping packets, he details the use of Data Center Quantized Congestion Notification (DCQCN), which combines Explicit Congestion Notification (ECN) to throttle senders and Priority Flow Control (PFC) as a hammer to pause traffic when buffers reach critical levels.

Looking toward the future, the summary addresses the evolution of these fabrics into scheduled fabrics and the upcoming standards from the Ultra Ethernet Consortium (UEC). Witte explains that scheduled fabrics improve efficiency by breaking massive elephant flows into smaller flowlets and spraying them across all available paths, though this requires advanced NICs or DPUs to reassemble packets in the correct order. He concludes by emphasizing that as GPU capabilities continue to outpace traditional networking, the industry is moving toward more deterministic, flow-aware scheduling and co-packaged optics to minimize latency and maximize tokens per watt in the race for AI compute efficiency.

Personnel: Michael Witte

Disaster Recovery Is Not Cyber Recovery (and It’s Not Close)

Attackers Log In — They Don’t Break In: Commvault Brings ResOps to the Identity Layer

The Value of Immutability with Object First

Protecting Your Data with Veeam

Your Private Cloud Still Has a Database Problem — VCF 9 Finally Fixes It

The Hundred-Year Cycle of Outsourced Computing

From Neural Networks to GPU Fabrics – Networking for Modern AI:ML Infrastructure

Sign up for updates to Tech Field day events

Sign up for updates to
Tech Field day events