|
This video is part of the appearance, “Juniper Networks Presents at AI Infrastructure Field Day 2“. It was recorded as part of AI Infrastructure Field Day 2 at 08:00 - 11:30 on April 23, 2025.
Watch on YouTube
Watch on Vimeo
Vikram Singh, Sr. Product Manager, AI Data Center Solutions at Juniper Networks, discussed maximizing AI cluster performance using Juniper’s self-optimizing Ethernet fabric. As AI workloads scale, high GPU utilization and minimized congestion are critical to maximizing performance and ROI. Juniper’s advanced load balancing innovations deliver a self-optimizing Ethernet fabric that dynamically adapts to congestion and keeps AI clusters running at peak efficiency.
The presentation addressed the unique challenges posed by AI/ML traffic, which is primarily UDP-based with low entropy, bursty flows, and the synchronous compute nature of data parallelism, where GPUs must synchronize gradients after each iteration. This synchronization makes job completion time a key metric, as delays in a single flow can idle many GPUs. Traditional Ethernet, designed for TCP in-order delivery requirements, doesn’t efficiently handle this type of traffic, leading to congestion and performance degradation. Solutions like packet spraying using specialized NICs or distributed scheduled fabrics are expensive and proprietary.
Juniper offers an open, standards-based approach using Ethernet, called AI load balancing, which includes dynamic load balancing (DLB) that enhances static ECMP by tracking link utilization and buffer pressure at microsecond granularity to make informed forwarding decisions. DLB operates in flowlet mode (breaking flows into subflows based on configurable pauses) or packet mode (packet spraying). Global Load Balancing (GLB) enhances DLB by exchanging link quality data between leaves and spines, enabling leaves to make more informed decisions and avoid congested paths. Juniper’s RDMA-aware load balancing (RLB) uses deterministic routing by assigning IP addresses to subflows, eliminating randomness and ensuring consistent high performance, in-order delivery, and non-rail performance without expensive hardware.
Personnel: Vikram Singh