|
Paresh Gupta, Siva Sivakumar, Nicholas Davidson, and Tushar Patel presented for Cisco at AI Field Day 5 |
This Presentation date is September 11, 2024 at 14:00-16:00.
Presenters: Jake Katz, Nicholas Davidson, Paresh Gupta, Siva Sivakumar, Tushar Patel
Navigating the AI Landscape Insights Innovations and Infrastructure Advancements with Cisco
Watch on YouTube
Watch on Vimeo
Whether you’re an AI enthusiast, data center manager, or technology strategist, this session offers valuable insights and practical knowledge to help you navigate the evolving AI landscape. Join us to learn about an overview of the AI market and the shift from InfiniBand to Ethernet in AI data centers. This session covers learnings from hyperscaler implementations and the evolving continuum of customer needs, from a la carte and build-your-own systems to turnkey solutions. Discover how Cisco is advancing AI infrastructure with innovations like the Cisco Nexus Hyperfabric AI in collaboration with NVIDIA. Learn how these advancements are making AI more accessible and scalable for businesses of all sizes.
Jake Katz, Vice President of AI/ML Product Management at Cisco, provided a comprehensive overview of the current AI landscape, emphasizing the transition from InfiniBand to Ethernet in AI data centers. He highlighted the significant role of hyperscalers in driving AI innovations, particularly in the development of large language models and GPU clusters. Katz noted that while hyperscalers are at the forefront of AI advancements, there remains a vast potential for enterprise adoption, which is still in its early stages. He discussed the increasing bandwidth demands driven by AI workloads, predicting a shift towards 800 gigabit data centers in the near future, and underscored the importance of power and cooling solutions as AI technologies evolve.
Katz introduced Cisco’s Nexus Hyperfabric, a cloud-based management system designed to simplify the deployment and management of AI clusters. This solution, developed in partnership with NVIDIA, aims to provide a plug-and-play experience for enterprises looking to harness AI capabilities without the complexity typically associated with such deployments. The Hyperfabric solution integrates high-performance Ethernet with a full hardware and software stack, allowing customers to manage their AI infrastructure efficiently. Katz emphasized that Cisco’s approach is tailored to meet the diverse needs of customers across the AI continuum, from hyperscalers to Fortune 5,000 enterprises, ensuring that organizations can effectively navigate their AI journeys with the right tools and infrastructure in place.
Personnel: Jake Katz
Demystifying Artificial Intelligence and Machine Learning Infrastructure for a Network Engineer with Cisco
Watch on YouTube
Watch on Vimeo
Cisco’s presentation at AI Field Day 5, led by Paresh Gupta and Nicholas Davidson, focused on demystifying AI/ML infrastructure for network engineers, particularly in the context of building and managing GPU clusters for AI workloads. Paresh, a technical marketing leader, began by explaining the challenges of setting up a GPU cluster, emphasizing the importance of inter-GPU networking and how Cisco’s Nexus 9000 Series switches address these challenges. He highlighted the complexity of cabling and configuring such clusters, which can take weeks to set up, but with Cisco’s validated solutions, the process can be streamlined to just eight hours. Paresh also discussed the importance of non-blocking, non-over-subscribed network designs, such as the “Rails Optimized” design used by Nvidia and the “Fly” design by Intel, which ensure efficient communication between GPUs during distributed AI training tasks.
The presentation also delved into the technical aspects of inter-GPU communication, particularly the need for collective communication protocols like all-reduce and reduce-scatter, which allow GPUs to synchronize their states during parallel processing. Paresh explained how Cisco’s network designs, such as the use of dynamic load balancing and static pinning, help optimize the flow of data between GPUs, reducing congestion and improving performance. He also touched on the importance of creating a lossless network using priority-based flow control to avoid packet loss, which can significantly delay AI training jobs. Cisco’s Nexus Dashboard plays a crucial role in monitoring and detecting anomalies, such as packet loss or congestion, ensuring that the network operates efficiently.
Nicholas Davidson, a machine learning engineer at Cisco, then shared his experience of building a generative AI (GenAI) application using the on-premises GPU cluster managed by Paresh. He explained how the infrastructure allowed him to train models on Cisco’s private data, which could not be moved to the cloud due to security concerns. By leveraging the GPU cluster, Nicholas was able to reduce training times from days to hours, processing billions of tokens in a fraction of the time it would have taken using cloud-based resources. He also demonstrated how the AI model, integrated with Cisco’s Nexus Dashboard, could provide real-time insights and anomaly detection for network engineers, showcasing the practical benefits of having an on-prem AI/ML infrastructure.
Personnel: Paresh Gupta
Kickstart AI in Your Data Center with Cisco Validated Designs
Watch on YouTube
Watch on Vimeo
In this presentation, Cisco outlines its approach to helping enterprises deploy AI infrastructure efficiently and effectively through Cisco Validated Designs (CVDs). The speakers, Siva Sivakumar and Tushar Patel, emphasize the growing importance of AI across industries and the challenges enterprises face in integrating AI into their existing IT infrastructure. Cisco’s solution is to provide a full-stack approach that simplifies the deployment of AI workloads, from training to fine-tuning and inferencing, using a combination of Cisco UCS servers, Nexus networking, and partnerships with key vendors like NVIDIA, Red Hat, NetApp, and Pure Storage. The goal is to eliminate the guesswork for enterprises by offering pre-validated, optimized designs that ensure high performance and scalability.
Cisco’s AI-ready infrastructure is built on a foundation of its Nexus network fabric and UCS servers, which are optimized for AI workloads. The company has developed a modular design that allows GPUs to be cycled independently of compute resources, providing flexibility and efficiency. Cisco also collaborates with partners like NVIDIA to integrate AI-specific software stacks, such as NVIDIA NGC and NIM, into its solutions. These validated designs are tailored for various AI use cases, including large language models (LLMs), computer vision, and Retrieval-Augmented Generation (RAG). Cisco’s CVDs are comprehensive, covering everything from hardware setup to software tuning, and are designed to be easily reproducible, reducing the time and complexity for enterprises to get started with AI.
The presentation also highlights Cisco’s commitment to continuous improvement and customer support. Cisco works closely with its partners to ensure that its solutions are up-to-date with the latest AI technologies and best practices. The company also offers advisory services to help customers navigate the complexities of AI deployment, from selecting the right models to optimizing infrastructure for specific workloads. Cisco’s long-term vision is to become a trusted advisor for enterprises on their AI journey, providing not just hardware and software but also the expertise and tools needed to ensure successful AI implementations.
Personnel: Siva Sivakumar, Tushar Patel