|
|
![]() Meghan Kachhi, Alex Burger, and Dan Backman presented for Cisco Data Center Networking at AI Infrastructure Field Day 4 |
This Presentation date is January 28, 2026 at 08:00AM – 09:30AM PT.
Presenters: Alex Burger, Dan Backman, Meghan Kachhi
Cisco AI Cluster: Design, Automation, and Visibility for Unified and Consistent Operations
The era of AI is transforming data centers, making advanced networks more essential than ever. To meet the demands of dynamic and distributed workloads, Cisco delivers flexible, scalable, and high-performance networking solutions purpose-built for AI.
Join us at AI Infrastructure Field Day 4 as Cisco engineers unveil the latest in AI networking. Discover how the Cisco Nexus 9000 (N9000) systems and reference architectures enable seamless front-end, storage, and back-end AI deployments. Explore the unified, on-premises Nexus Dashboard for simplified, consistent management across environments. Learn about Nexus Hyperfabric—a turnkey, cloud-managed AI POD that accelerates deployment from months to weeks and boosts operational efficiency.
See it all in action with demos, and experience how Cisco’s integrated AI Networking solution empowers organizations to streamline operations, support modern AI and non-AI workloads, and drive innovation.
Don’t miss this exclusive technical deep dive into the future of AI networking with Cisco.
Cisco AI Cluster Design, Automation, and Visibility
Watch on YouTube
Watch on Vimeo
Cisco’s presentation on AI Cluster Design, Automation, and Visibility, led by Meghan Kachhi and Richard Licon, aims to simplify AI infrastructure and address the challenges of lengthy design and troubleshooting cycles for GPU clusters. The core focus is on enhancing cluster designs, automating deployments, and providing end-to-end visibility to protect a competitive edge. The session outlines Cisco’s reference architectures, key components for building AI clusters, and upcoming updates to its Nexus Dashboard platform, which is expected to streamline design, automation, and monitoring at scale. This comprehensive approach is crucial because the battle for AI success lies at the infrastructure layer, ensuring GPUs are not underutilized by network inefficiencies.
Cisco leverages three unique pillars in its AI networking strategy. Firstly, its systems feature custom Silicon One platforms with programmable pipelines that quickly adapt to evolving AI infrastructure demands, and a partnership with NVIDIA that provides NX-OS on NVIDIA Spectrum X silicon to ensure full-stack reference architecture compliance. Rigorously tested transceivers and a mature NX-OS software, now optimized for AI workloads, complete the system offerings. Secondly, the operating model includes the Nexus Dashboard for on-premises management and Nexus Hyperfabric for a full-stack, cloud-managed solution, complemented by an API-first approach to seamless integration with existing customer automation frameworks. Thirdly, extensive AI reference architectures serve as validated blueprints, spanning enterprise-scale deployments (under 1024 GPUs) to hyperscale cloud environments (1K-16K+ GPUs), providing detailed component lists and ensuring a consistent networking experience across vendors such as NVIDIA, AMD, and storage solutions. An AI cluster is broadly defined to encompass front-end, storage, and backend GPU-to-GPU networks, with a growing trend toward convergence enabled by high-speed Ethernet to unify operating models.
Designing an efficient AI backend network requires a non-blocking architecture that maintains a 1:1 subscription ratio, keeping every GPU within one hop of others for optimal communication. Cisco employs a “scalable unit” concept, enabling incremental expansion by repeating validated blocks while adjusting spine-layer connectivity to maintain high performance. For smaller-scale deployments, such as a 32-GPU university cluster, Cisco demonstrates how front-end, storage, and backend networks can be converged onto fewer, high-density switches, simplifying infrastructure. A critical consideration for such converged environments is Cisco’s policy-based load balancing, an innovation leveraging Silicon One ASICs. This enables preferential treatment of critical traffic, such as GPU-to-GPU training, over storage or front-end traffic, ensuring AI jobs run with minimal latency and maximum GPU utilization, even when sharing network resources.
Personnel: Meghan Kachhi, Richard Licon
Cisco Reference Architectures for AI Networking with the Nexus Dashboard
Watch on YouTube
Watch on Vimeo
Cisco provides comprehensive reference architectures for AI networking, scalable from small 96-GPU clusters up to massive 32,000-GPU deployments. These designs, available on Cisco.com and Nvidia.com, are vendor-agnostic, supporting Nvidia, AMD, and Intel. The core focus is to simplify operations for customers, ensuring ease of design at scale while maintaining automation and end-to-end visibility. This is achieved through the Nexus Dashboard platform, which streamlines the complex requirements of AI infrastructure.
The Nexus Dashboard significantly simplifies AI networking management. It enables customers to quickly create AI fabrics, choosing between routed or VXLAN EVPN options, with best-practice configurations for lossless fabrics, including QoS, ECN, and PFC, automatically applied. The platform also enables easy activation of advanced features, such as Dynamic Load Balancing (DLB), with minimal clicks. It facilitates the discovery and onboarding of switches into the AI fabric, organizes them into scalable units, and provides guardrails against misconfigurations. Customers can manage their AI clusters seamlessly alongside traditional data center and storage fabrics, leveraging a unified dashboard that offers clear topology views and inventory details of switches, interfaces, and connected GPUs.
Beyond setup and management, the Nexus Dashboard provides critical visibility into AI jobs and troubleshooting capabilities. Integrating with workload managers like Slurm enables users to monitor AI jobs and correlate network performance with GPU and NIC issues. The dashboard offers an “at a glance” view of AI resources, highlighting anomalies and advisories. Users can drill down into specific jobs to visualize resource utilization and pinpoint performance bottlenecks. Detailed analytics provide insights into Ethernet interface drops, CRC errors, and GPU-specific metrics, including temperature, utilization, and power. The platform generates job-specific topologies, identifies anomalies down to individual links and GPUs, and provides actionable insights for root-cause analysis and resolution. For customers seeking integration with multi-vendor environments or custom automation workflows, the Nexus Dashboard also provides a comprehensive set of APIs that complement Cisco’s broader AI Canvas for multi-domain orchestration.
Personnel: Meghan Kachhi, Richard Licon
Building AI Pods with Nexus Hyperfabric from Cisco
Watch on YouTube
Watch on Vimeo
This presentation introduces Cisco Nexus Hyperfabric, a cloud-managed platform that simplifies the deployment and ongoing management of AI infrastructure. It addresses the growing need for repeatable, scalable, and operationally efficient networks specifically for enterprise AI clusters. Cisco emphasizes that while hyperscalers build immense AI factories, a significant and growing market exists for smaller, enterprise-level AI deployments, often below 256 nodes, which they term “AI Clusters for the Rest of Us.”
The shift to these smaller, on-premises AI clusters is driven by several factors, including the increasing size and sensitivity of data (e.g., healthcare, intellectual property), making cloud undesirable, a trend of workloads returning from the cloud, and the need for project- or application-specific infrastructure rather than shared general-purpose IT. The rapidly evolving AI technology also means enterprises prefer incremental build-outs rather than massive, infrequent investments, allowing them to leverage newer generations of hardware more frequently. However, designing and deploying these dense, complex, lossless Ethernet networks is challenging and time-consuming for traditional network practitioners, often involving weeks of design, lengthy procurement, and meticulous cabling.
Cisco Nexus Hyperfabric addresses these challenges by delivering a Meraki-like SaaS experience for data center network deployment. It offers pre-designed, NVIDIA ERA-compliant templates for AI clusters that automate the generation of a complete bill of materials, including optics and cables. This drastically reduces design time and eliminates manual errors, accelerating the “time to first token” for AI projects. Hyperfabric also streamlines day-one operations with step-by-step cabling instructions and real-time validation via server-side agents, ensuring correct physical connectivity. Beyond deployment, it provides end-to-end network visibility, proactive monitoring of components such as optics, and integrates advanced Ethernet features, including lossless capabilities (PFC, ECN) and adaptive routing, to optimize performance for demanding AI workloads.
Personnel: Alex Burger, Dan Backman










