Cisco Presents at Networking Field Day 39

Networking Field Day 39

Cisco Presented at Networking Field Day 39

This Presentation date is November 6, 2025 at 10:30-12:00.

Presenters: Arun Annavarapu, Paresh Gupta

Follow on Twitter using the following hashtags or usernames: #NFD39

Cisco AI Cluster Design and Operational Strategies

Watch on YouTube
Watch on Vimeo

Arun Anavarpu, Director of Product Management for Cisco’s Data Center Networking Group, opened the presentation by framing the massive industry shift towards AI. He noted that the evolution from LLMs to agentic AI and edge inferencing creates an AI continuum that places unprecedented demands on the underlying infrastructure. The network is the key component, tasked with supporting new scale-up, scale-out, and even scale-across fabrics that connect data centers across geographies. Anavarpu emphasized that the network is no longer just a pipe. It must be available, lossless, resilient, and secure. He stressed that any network problems will directly correlate to poor GPU utilization, making network reliability essential for protecting the significant financial investment in AI infrastructure.

Cisco’s strategy to meet these challenges is to provide a complete, end-to-end solution that spans from its custom silicon and optics to the hardware, software, and the operational model. A critical piece of this strategy is simplifying the operating model for these complex AI networks. This model is designed to provide easy day-zero provisioning, allowing operators to deploy entire AI fabrics with a few clicks rather than pages of configuration. This is complemented by deep day-two visibility through telemetry, analytics, and proactive remediation, all managed from a single pane of glass that provides a unified view across all fabric types.

To deliver this operational model, Cisco offers two primary form factors. The first is the Nexus Dashboard, a unified, on-premises solution that allows customers to manage their own provisioning, security, and analytics for AI fabrics. The second option is HyperFabric AI, a SaaS-based platform where Cisco manages the management software, offering a more hands-off, cloud-driven experience. Anavarpu explained that both of these solutions can feed data into higher-level aggregation layers like AI Canvas and Splunk. These tools provide cross-product correlation and advanced analytics, enabling the faster troubleshooting and operational excellence required by the new age of AI.

Personnel: Arun Annavarapu

Introduction to Cisco AI Cluster Networking Design with Paresh Gupta

Watch on YouTube
Watch on Vimeo

Paresh Gupta, a principal engineer at Cisco focusing on AI infrastructure, began by outlining the diverse landscape of AI adoption, which spans from hyperscalers with hundreds of thousands of GPUs to enterprises just starting with a few hundred. He categorized these environments by scale—scale-up within a server, scale-out across servers, and scale-across between data centers—and by use case, such as foundational model training versus fine-tuning or inferencing. Gupta emphasized that the solutions for these different segments must vary, as the massive R&D budgets and custom software of a hyperscaler are not available to an enterprise, which needs a simpler, more turnkey solution.

Gupta then deconstructed the modern AI cluster, starting with the immense computational power of GPU servers, which can now generate 6.4 terabits of line-rate traffic per server. He detailed the multiple, distinct networks required, highlighting a recent shift in best practices: the front-end network and the storage network are now often converged. This change is driven by cost savings and the realization that front-end traffic is typically low, making it practical to share the high-bandwidth 400-gig fabric. This converged network is distinct from the inter-GPU backend network, which is dedicated solely to GPU-to-GPU communication for distributed jobs, as well as a separate management network and potentially a backend storage network for specific high-performance storage platforms.

Finally, Gupta presented a simplified, end-to-end traffic flow to illustrate the complete operational picture. A user request does not just hit a GPU; it first traverses a standard data center fabric, interacts with applications and centralized services like identity and billing, and only then reaches the AI cluster’s front-end network. From there, the GPU node may access high-performance storage, standard storage for logs, or organization-wide data. If the job is distributed, it ignites the inter-GPU backend network. This complete flow, he explained, is crucial for understanding that solving AI networking challenges requires innovations at every point of entry and exit, not just in the inter-GPU backend.

Personnel: Paresh Gupta

Cisco AI Networking Cluster Operations Deep Dive

Watch on YouTube
Watch on Vimeo

Paresh Gupta’s deep dive on AI cluster operations focused on the extreme and unique challenges of high-performance backend networks. He explained that these networks, which primarily use RDMA over Converged Ethernet (ROCE), are exceptionally sensitive to both packet loss and network delay. Because ROCE is UDP-based, it lacks TCP’s native congestion control, meaning a single dropped packet can stall an entire collective communication operation, forcing a costly retransmission and wasting expensive GPU cycles. This problem is compounded by AI traffic patterns, such as checkpointing, where all GPUs write to storage simultaneously, creating massive incast congestion. Gupta emphasized that in these environments, where every nanosecond of delay matters, traditional network designs and operational practices are no longer sufficient.

Cisco’s strategy to solve these problems is built on prescriptive, end-to-end validated reference architectures, which are tested with NVIDIA, AMD, Intel Gaudi, and all major storage vendors. Gupta detailed the critical importance of a specific Rail-Optimized Design, a non-blocking topology engineered to ensure single-hop connectivity between all GPUs within a scalable unit. This design minimizes latency by keeping traffic off the spine switches, but its performance is entirely dependent on perfect physical cabling. He explained that these architectures are built on Cisco’s smart switches, which use Silicon One ASICs and are optimized with fine-tuned thresholds for congestion-notification protocols like ECN and PFC.

The most critical innovations, however, are in operational simplicity, delivered via Nexus Dashboard and HyperFabric AI. These platforms automate and hide the underlying network complexity. Gupta highlighted the automated cabling check feature. The system generates a precise cabling plan for the rail-optimized design and provides a task list to on-site technicians; the management UI will only show a port as green when it is cabled to the exact correct port, solving the pervasive and performance-crippling problem of miscabling. This feature, which customers reported reduced deployment time by 90%, is combined with job scheduler integration to detect and flag performance-degrading anomalies, such as a single job being inefficiently spread across multiple scalable units.

Personnel: Paresh Gupta

Cisco AI Cluster Networking Operations DLB Demo with Paresh Gupta

Watch on YouTube
Watch on Vimeo

Paresh Gupta concluded the deep dive by focusing on the most complex challenge in AI networking: congestion and load balancing in the backend GPU-to-GPU fabric. He explained that while operational simplicity and cabling are critical, the primary performance bottleneck, even in non-oversubscribed networks, is the failure of traditional ECMP load balancing. Because ECMP hashes flows randomly, it creates severe traffic imbalances, where one link may be congested at 105% capacity while another sits idle at 60%. This non-uniform utilization, not a lack of total capacity, creates congestion, triggers performance-killing pause frames, and can lead to out-of-order packets, which are devastating for tightly coupled collective communication jobs.

To solve this, Cisco has developed advanced load-balancing techniques, moving beyond simple ECMP. Gupta detailed a “flowlet” dynamic load balancing (DLB) approach, where the switch detects inter-packet gaps to identify a flowlet and sends the next flowlet on the current, least-congested link. More importantly, he highlighted a fully validated, joint-reference architecture codesigned with NVIDIA. This solution combines Cisco’s per-packet DLB, running on its switches, with NVIDIA’s adaptive routing and direct data placement capabilities on the SuperNIC. This handshake between the switch and the NIC is auto-negotiated, and Gupta showed video benchmarks of a 64-GPU cluster where this method improved application-level bus bandwidth by 35-40% and virtually eliminated pause frames compared to standard ECMP.

This advanced capability, Gupta explained, is made possible by the P4-programmable architecture of Cisco’s Silicon One ASIC, which allows new features to be delivered without a multi-year hardware respin. He framed this as the foundational work that is now being standardized by the Ultra Ethernet Consortium (UEC), of which Cisco is a steering member. By productizing these next-generation transport features today, Cisco is able to provide a consistent, high-performance, and validated networking experience for any AI environment, offering enterprises a turnkey solution that rivals the performance of complex, custom-built hyperscaler networks.

Personnel: Paresh Gupta

Event Calendar

Latest Coverage

Tech Field Day News