|
This video is part of the appearance, “Cisco Enterprise Networking Presents at Tech Field Day Extra at Cisco Live US 2023“. It was recorded as part of Tech Field Day Extra at Cisco Live US 2023 at 13:00-14:30 on June 6, 2023.
Watch on YouTube
Watch on Vimeo
Artificial Intelligence and Machine Learning is part of many industries and day to day life, and it will expand in the future. This session shows how Ethernet Networks use RoCEv2 transport benefits AI/ML clusters. You’ll also see a demonstration of congestion management capabilities of Nexus switches, that will improve AI workload transports.
Nemanja Kamenica, a technical marketing engineer at Cisco, presented an AI/ML data center networking blueprint. The presentation highlighted the diverse applications of AI in various sectors such as medical research, financial services, public transport optimization, manufacturing, and retail recommendations. Kamenica discussed the two types of AI clusters: distributed training clusters and product inference clusters. The network requirements for AI training networks, including non-blocking transport, lossless Ethernet, and ROC-UV2 technology with PFC and ECN for congestion management, were outlined. A demo showcased congestion occurring when multiple hosts simultaneously sent data to a storage device, leading to congestion on a specific port. To address congestion issues, the Nexus Dashboard Fabric Controller allows the configuration of QoS mechanisms such as WRED, ECN, and PFC. Proper management of bursty all-to-all communication within AI clusters is crucial to prevent congestion and mitigate potential financial losses.
Personnel: Nemanja Kamenica