|
Broadcom Presented at Networking Field Day 32 |
This Presentation date is July 26, 2023 at 8:00-10:00.
Presenters: Henry Wu, Mohan Kalkunte, Pete Del Vecchio
Follow on Twitter using the following hashtags or usernames: #NFD32
Broadcom Ethernet Fabric for AI and ML at Scale
Watch on YouTube
Watch on Vimeo
In this discussion, Mohan Kalkuntay, VP of Architecture and Technology at Broadcom, highlights the significance of Ethernet fabric and introduces Broadcom’s solutions tailored for AI applications. AI applications are characterized by their complex requirements, including large models with billions of parameters. GPUs play a vital role in AI processing, and networking is essential for interconnecting and coordinating large GPU clusters. The compute, communication, and synchronization phases are crucial in AI training, where large neural networks are trained and gradients and parameters are exchanged. Networking in AI is unique, with fewer flows, high bandwidth, synchronization, bursty traffic, and potential challenges like flow collisions and link failures. Tail latency greatly impacts the performance of AI training, and minimizing it leads to faster job completion. To improve AI networking, techniques like network telemetry, packet spraying, load-aware ECMP, zero impact failure, and credit control mechanisms are employed.
Personnel: Mohan Kalkunte
Broadcom Jericho3 AI Ethernet Fabric
Watch on YouTube
Watch on Vimeo
Broadcom offers scheduled fabric solutions, with a focus on switch-based architecture using the Jericho and Ramon families. This architecture ensures congestion-free operations and perfect load balancing for AI workloads. The leaf devices handle switch functionality, lookup, queuing, and scheduling, while the spine devices interconnect the leaf devices and are optimized for AI workloads. The fabric allows for traffic segmentation and reassembly in cells, with hardware-based mechanisms ensuring no impact from network failures. The scalable architecture can support networks ranging from a few hundred boxes to tens of thousands, with the latest generation devices allowing for scaling from 18K to 32K nodes and 400GB to 800GB Ethernet. It provides multi-tenancy and job isolation for AI clusters, and offers hardware-based telemetry and diagnostic tools for visibility into network performance. The future-proof design ensures adaptability to evolving AI models and workloads.
Personnel: Henry Wu
Broadcom AI Interconnect and Tomahawk AI Fabrics
Watch on YouTube
Watch on Vimeo
Broadcom’s Tomahawk solution is a game-changer in the AI interconnect realm, offering tailored solutions to hyperscalers with diverse needs. The Tomahawk Fabric provides high bandwidth and high radix, catering to flatter network topologies. To optimize network performance, load balancing, traffic spraying, and advanced telemetry are implemented. Fast link failover ensures uninterrupted traffic flow during link failures, enhancing network reliability. The solution supports various congestion management mechanisms, with flexibility for customization. With Tomahawk 5’s native support for 800 GigE ports, the solution meets the rising demand for higher bandwidth in modern AI architectures. Broadcom’s Tomahawk solution sets new standards in efficient and reliable AI interconnectivity, adapting to the evolving industry requirements.
Personnel: Pete Del Vecchio
Broadcom Ethernet Fabric for AI ML at Scale Wrap Up
Watch on YouTube
Watch on Vimeo
Broadcom’s silicon solutions and network operating systems drive an ecosystem that promotes Ethernet as a cost-effective and high-performing alternative to InfiniBand. They offer advanced capabilities through silicon solutions like Tomahawk 5 and Jericho 3 AI, which can be complemented by various network operating system options. Ethernet’s impressive network performance has been proven through success stories in platforms like AWS and Oracle Cloud Infrastructure, where it has outperformed InfiniBand and provided significant cost savings. As a founding member of the Ultra Ethernet Consortium, Broadcom actively contributes to the development of innovative Ethernet standards, further enhancing its appeal for large-scale cloud data centers. The success stories highlight Ethernet’s effectiveness across diverse applications, solidifying its position as a compelling technology choice for various industries.
Personnel: Mohan Kalkunte