Watch on YouTube
Watch on Vimeo
Artificial Intelligence and Machine Learning is part of many industries and day to day life, and it will expand in the future. This session shows how Ethernet Networks use RoCEv2 transport benefits AI/ML clusters. You’ll also see a demonstration of congestion management capabilities of Nexus switches, that will improve AI workload transports.
Nemanja Kamenica, a technical marketing engineer at Cisco, presented an AI/ML data center networking blueprint. The presentation highlighted the diverse applications of AI in various sectors such as medical research, financial services, public transport optimization, manufacturing, and retail recommendations. Kamenica discussed the two types of AI clusters: distributed training clusters and product inference clusters. The network requirements for AI training networks, including non-blocking transport, lossless Ethernet, and ROC-UV2 technology with PFC and ECN for congestion management, were outlined. A demo showcased congestion occurring when multiple hosts simultaneously sent data to a storage device, leading to congestion on a specific port. To address congestion issues, the Nexus Dashboard Fabric Controller allows the configuration of QoS mechanisms such as WRED, ECN, and PFC. Proper management of bursty all-to-all communication within AI clusters is crucial to prevent congestion and mitigate potential financial losses.
Personnel: Nemanja Kamenica
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!