Watch on YouTube
Watch on Vimeo
Cisco provides comprehensive reference architectures for AI networking, scalable from small 96-GPU clusters up to massive 32,000-GPU deployments. These designs, available on Cisco.com and Nvidia.com, are vendor-agnostic, supporting Nvidia, AMD, and Intel. The core focus is to simplify operations for customers, ensuring ease of design at scale while maintaining automation and end-to-end visibility. This is achieved through the Nexus Dashboard platform, which streamlines the complex requirements of AI infrastructure.
The Nexus Dashboard significantly simplifies AI networking management. It enables customers to quickly create AI fabrics, choosing between routed or VXLAN EVPN options, with best-practice configurations for lossless fabrics, including QoS, ECN, and PFC, automatically applied. The platform also enables easy activation of advanced features, such as Dynamic Load Balancing (DLB), with minimal clicks. It facilitates the discovery and onboarding of switches into the AI fabric, organizes them into scalable units, and provides guardrails against misconfigurations. Customers can manage their AI clusters seamlessly alongside traditional data center and storage fabrics, leveraging a unified dashboard that offers clear topology views and inventory details of switches, interfaces, and connected GPUs.
Beyond setup and management, the Nexus Dashboard provides critical visibility into AI jobs and troubleshooting capabilities. Integrating with workload managers like Slurm enables users to monitor AI jobs and correlate network performance with GPU and NIC issues. The dashboard offers an “at a glance” view of AI resources, highlighting anomalies and advisories. Users can drill down into specific jobs to visualize resource utilization and pinpoint performance bottlenecks. Detailed analytics provide insights into Ethernet interface drops, CRC errors, and GPU-specific metrics, including temperature, utilization, and power. The platform generates job-specific topologies, identifies anomalies down to individual links and GPUs, and provides actionable insights for root-cause analysis and resolution. For customers seeking integration with multi-vendor environments or custom automation workflows, the Nexus Dashboard also provides a comprehensive set of APIs that complement Cisco’s broader AI Canvas for multi-domain orchestration.
Personnel: Meghan Kachhi, Richard Licon
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!