Cisco AI Cluster Design, Automation, and Visibility

This video is part of the appearance, "Cisco Data Center Networking Presents at AI Infrastructure Field Day". It was recorded as part of AI Infrastructure Field Day 4 at 08:00AM – 09:30AM PT on January 28, 2026.

Watch on YouTube
Watch on Vimeo

Cisco’s presentation on AI Cluster Design, Automation, and Visibility, led by Meghan Kachhi and Richard Licon, aims to simplify AI infrastructure and address the challenges of lengthy design and troubleshooting cycles for GPU clusters. The core focus is on enhancing cluster designs, automating deployments, and providing end-to-end visibility to protect a competitive edge. The session outlines Cisco’s reference architectures, key components for building AI clusters, and upcoming updates to its Nexus Dashboard platform, which is expected to streamline design, automation, and monitoring at scale. This comprehensive approach is crucial because the battle for AI success lies at the infrastructure layer, ensuring GPUs are not underutilized by network inefficiencies.

Cisco leverages three unique pillars in its AI networking strategy. Firstly, its systems feature custom Silicon One platforms with programmable pipelines that quickly adapt to evolving AI infrastructure demands, and a partnership with NVIDIA that provides NX-OS on NVIDIA Spectrum X silicon to ensure full-stack reference architecture compliance. Rigorously tested transceivers and a mature NX-OS software, now optimized for AI workloads, complete the system offerings. Secondly, the operating model includes the Nexus Dashboard for on-premises management and Nexus Hyperfabric for a full-stack, cloud-managed solution, complemented by an API-first approach to seamless integration with existing customer automation frameworks. Thirdly, extensive AI reference architectures serve as validated blueprints, spanning enterprise-scale deployments (under 1024 GPUs) to hyperscale cloud environments (1K-16K+ GPUs), providing detailed component lists and ensuring a consistent networking experience across vendors such as NVIDIA, AMD, and storage solutions. An AI cluster is broadly defined to encompass front-end, storage, and backend GPU-to-GPU networks, with a growing trend toward convergence enabled by high-speed Ethernet to unify operating models.

Designing an efficient AI backend network requires a non-blocking architecture that maintains a 1:1 subscription ratio, keeping every GPU within one hop of others for optimal communication. Cisco employs a “scalable unit” concept, enabling incremental expansion by repeating validated blocks while adjusting spine-layer connectivity to maintain high performance. For smaller-scale deployments, such as a 32-GPU university cluster, Cisco demonstrates how front-end, storage, and backend networks can be converged onto fewer, high-density switches, simplifying infrastructure. A critical consideration for such converged environments is Cisco’s policy-based load balancing, an innovation leveraging Silicon One ASICs. This enables preferential treatment of critical traffic, such as GPU-to-GPU training, over storage or front-end traffic, ensuring AI jobs run with minimal latency and maximum GPU utilization, even when sharing network resources.

Personnel: Meghan Kachhi, Richard Licon

Veeam at TFDx RSAC 2026: Three Generations of Resilience

What Happens When an AI Agent Changes Your RPO to Save Money?

Data Gravity Is Killing Your AI Strategy. Hammerspace Breaks the Chain.

GitHub Wants Your Code, Wikipedia Says No to AI, and RSAC’s Biggest Takeaways

The Internet Archive Is Preserving the Digital Library of Alexandria

Bridging Worlds with Hammerspace and the Reality of Multi-Cloud Mobility

Cisco AI Cluster Design, Automation, and Visibility

Sign up for updates to Tech Field day events

Sign up for updates to
Tech Field day events