|
Sharon Nagar, Robin Grindley, and Hemal Shah presented for Broadcom at Cloud Field Day 19 |
This Presentation date is February 1, 2024 at 13:00-14:30.
Presenters: Hemal Shah, Robin Grindley, Sharon Nagar
Delegate Panel: Ron Westfall, Tom Hollingsworth
Broadcom Trident5-X12: Smarter Cloud Infrastructure
Watch on YouTube
Watch on Vimeo
Network infrastructure for the cloud is undergoing a phase change. Robin Grindley Principal PLM, presents Broadcom’s new Trident5-X12 chip introduces new capabilities to enhance performance and security, aided by a the introduction of a novel line-rate, packet-processing inference engine called NetGNT.
In this presentation, Robin Grindley from Broadcom introduces the new Trident5-X12 chip, which is designed to upgrade cloud infrastructure by enhancing performance and security. The chip features a line-rate packet-processing inference engine called NetGNT, which allows for real-time analysis of network traffic to identify patterns and potential security threats without software intervention.
The Trident5-X12 chip offers various capabilities, including support for 800 gigabit Ethernet ports, cognitive routing, and programmable packet processing pipelines using Broadcom’s Network Programming Language (NPL). The NetGNT engine is a key innovation that can be trained to recognize specific traffic patterns, such as those associated with denial-of-service attacks, and take appropriate actions at line rate.
Grindley emphasizes that the cloud is evolving with new demands, particularly due to AI and ML workloads, which require advanced networking solutions capable of handling high bandwidth and providing security at scale. The Trident5-X12 chip is positioned to address these needs by offering powerful, programmable hardware that operates at the speeds required by modern cloud infrastructures.
Personnel: Robin Grindley
Broadcom Qumran3D: The Industry’s First 5nm 25.6T Router
Watch on YouTube
Watch on Vimeo
Sharon Nagar, Principal PLM, gives a review of the latest Broadcom innovations in the WAN space and the world’s first 5nm, 25T router on a single chip. The presentation will cover the phenomenal level of integration that went into the Qumran3D shrinking what used to be a multi RU chassis into a single chip solution, significantly reducing the space and power needed to operate high end routers.
Nagar highlights Broadcom’s work in switching and routing, noting the three main product lines: Trident, Tomahawk, and Jericho, each optimized for different market segments and use cases. These devices share common infrastructure, including connectivity, enhanced telemetry features, and the software core known as the SDK.
The Qumran 3D is a 25 terabit router that is a full carrier-grade router with deep buffers, large routing databases capable of holding the entire internet routing table, and an encryption engine for all its throughput. The Qumran 3D represents a significant advancement in integration, shrinking what used to be a multi-RU (rack unit) chassis into a single chip solution, which reduces space and power requirements for high-end routers.
Sharon explains that the Qumran 3D is part of the Qumran product line, which has been around for ten years and is a parallel line to the Jericho series. The Qumran 3D offers a 100-fold speed increase and 95% power efficiency improvement over the technology from ten years ago. The device has 256 100G PAM4 SerDes, allowing for various port speeds and configurations, and integrates advanced features such as hierarchical traffic management, encryption, and a large number of access control lists (ACLs) and counters.
The Qumran 3D is designed to be flexible and modular, supporting virtualization and multi-tenancy for cloud environments. It comes with tens of thousands of virtual routes and thousands of tunnels, allowing for a large number of services and customers. The routing table has ample capacity to accommodate internet routing table growth until 2030, and the device has over one million ACL rules for policies and security. The encryption engine provides the option to encrypt all traffic without limitations, and the counters on the device help monitor and manage traffic flow and subscriber usage.
The device’s SerDes exceed industry standards, allowing for various optical and copper connection solutions, including linear drive optics, coherent optics, co-packaged optics, and extended reach copper cables. This flexibility offers cost and power savings for operators.
The Qumran 3D’s capabilities make it suitable for various network applications beyond cloud settings, including service provider environments. It can replace larger, more complex systems with multiple chips, simplifying software control and reducing power and space requirements.
Personnel: Sharon Nagar
Broadcom Thor 2: High Performance Ethernet NIC for AI/ML
Watch on YouTube
Watch on Vimeo
The large scale of AI/ML cluster requires high-performance networking solutions. In this talk, we will provide an overview of Broadcom’s high-performance Ethernet NIC for AI/ML clusters. Hemal Shah, Distinguished Engineer and Architect, will describe RoCE and congestion control features of the NIC, a reference AI/ML cluster architecture based on Broadcom switches and NICs, and benefits of end-2-end networking.
Shah begins with a discussion of the importance of high-performance networking for AI/ML clusters. He emphasizes that as AI/ML workloads increase in complexity and scale, networking becomes crucial for efficient job completion times. Shah provides an overview of Broadcom’s Ethernet NIC (Network Interface Card), which is designed to meet the demands of AI/ML clusters.
He explains that AI/ML clusters require networking that can handle large amounts of data and support high-speed, low-latency communication between nodes. Broadcom’s NICs and switches are designed to work together to provide end-to-end networking solutions that address these needs.
Shah outlines the key features of Broadcom’s 400 gig NIC, including:
– Support for RDMA over Converged Ethernet (RoCE) and congestion control, which are important for AI/ML workloads.
– The ability to handle 400 gig bi-directional line rates with low latency to ensure rapid data transfer.
– PCIe Gen 5 by 16 host interface compatibility to maintain high throughput.
– Advanced congestion control mechanisms that react to network congestion and optimize traffic flow.
– Security features like hardware root of trust to ensure only authenticated firmware runs on the NIC.
Shah also discusses the reference architecture for an AI/ML cluster that incorporates Broadcom switches and NICs, designed to scale to thousands of GPUs and provide robust networking capabilities. He concludes by highlighting the importance of end-to-end fabric management for operating large-scale networks effectively, which includes automation, performance monitoring, and diagnostic capabilities.
Personnel: Hemal Shah