Fabrix.ai Demo – Building Agentic AI at scale for Production

Event: AI Infrastructure Field Day 4

Appearance: Fabrix.ai Presents at AI Infrastructure Field Day

Company: Fabrix.ai

Video Links:

Personnel: Rached Blili

Fabrix.ai is building agentic AI at scale for production, moving beyond proofs of concept to deliver robust solutions. In the video from the Fabrix.AI channel, Rached Blili demonstrated the Fabrix.ai platform, highlighting its agent catalog, where users can access and manage a variety of agents, both developed by Fabrix.ai and custom-built. The platform offers an AI Storyboard dashboard that provides a comprehensive view of AI operations, enabling agents to be organized into projects with distinct permissions and toolsets. A significant emphasis is placed on observability, including detailed AI cost tracking at both global and project levels, and visibility into individual “conversations” or agentic sessions. Uniquely, Fabrix.ai provides performance evaluation for agents, treating them as digital workers by monitoring their performance over time, identifying top and underperforming agents, and suggesting specific fixes, such as modifying system prompts, to continuously improve their efficacy.

The demonstration showcases two types of agents: autonomous and interactive. Autonomous agents operate in the background, triggered by events, alerts, or schedules, as exemplified by a Network Root Cause Analysis agent. This agent automatically diagnoses network failures, such as router configuration errors, by analyzing logs, incident data, and router configurations. It generates comprehensive reports detailing the root cause, impact assessment, and multiple remediation plans, which a remediation agent can then use for automated implementation and verification. For interactive use, Fabrix.ai’s copilot, Fabio, enables users to converse directly with agents to manage complex tasks, such as verifying VPNs or configuring Netflow in a lab network, significantly reducing manual intervention and saving time.

Delving into the underlying architecture, the presentation revealed that complex problems are tackled using multi-agent complexes, where an orchestrator agent calls specialized sub-agents, each handling a specific part of the problem with a sequestered context. This approach enhances individual agents’ capabilities while enabling detailed cost management, tracking token usage, time, and expenses, and capturing individual agent contributions within a hierarchical structure. A detailed example illustrated an application root-cause analysis in which the orchestrator agent systematically investigated incident details, application dependency maps, and even interpreted plain-English change requests from a ticketing system. The platform’s advanced context and tooling engines are critical to operating at scale, enabling mass operations across numerous devices in parallel and efficiently processing vast tool outputs by storing them in a context cache for later retrieval and analysis, ensuring effective, secure, and reliable agent deployment.


Crossing the Production Gap to Agentic AI with Fabrix.ai

Event: AI Infrastructure Field Day 4

Appearance: Fabrix.ai Presents at AI Infrastructure Field Day

Company: Fabrix.ai

Video Links:

Personnel: Rached Blili

Fabrix.ai highlights the critical challenges in deploying agentic AI from prototype to production within large enterprises. The Rached Blili noted that while agents are quick to prototype, they frequently fail in real-world environments due to dynamic variables. These failures typically stem from issues in context management, such as handling large tool responses and maintaining “context purity,” as well as from operational challenges related to observability and infrastructure, including security and user rights. To overcome these hurdles, Fabrix.ai proposes three core principles: moving as much of the problem as possible to the tooling layer, rigorously curating the context fed to the Large Language Model (LLM), and implementing comprehensive operational controls that monitor for business outcomes rather than just technical errors.

Fabrix.ai’s solution is a middleware built on a “trifabric platform” comprising data, automation, and AI fabrics. This middleware features two primary functional components: the Context Engine and the Tooling and Connectivity Engine. The Context Engine focuses on delivering pure, relevant information to the LLM through intelligent caching of large datasets (making them addressable and providing profiles such as histograms) and sophisticated conversation compaction that tailors summaries to the current user goal, preserving critical information better than traditional summarization. The Tooling and Connectivity Engine serves as an abstraction layer that integrates various enterprise tools, including existing MCP servers and non-MCP tools. It allows tools to exchange data directly, bypassing the LLM and preventing token waste. This engine uses a low-code, YAML-based approach for tool definition and dynamic data discovery to automatically generate robust, specific tools for common enterprise workflows, thereby reducing the LLM’s burden and improving reliability.

Beyond these core components, Fabrix.ai emphasizes advanced operational capabilities. Their platform incorporates qualitative analysis of agentic sessions, generating reports, identifying themes, and suggesting optimizations to improve agent performance over time, effectively placing agents on a “performance improvement plan” (PIP). This outcome-based evaluation contrasts with traditional metrics like token count or latency. Case studies demonstrated Fabrix.ai’s ability to handle queries across vast numbers of large documents, outperforming human teams in efficiency and consistency, and to correlate information across numerous heterogeneous systems without requiring a data lake, thanks to dynamic data discovery. The platform also includes essential spend management and cost controls, recognizing the risk that agents may incur high operational costs if not properly managed.


Build Reliable, Secure, and Performant Agents using Fabrix.AI AgentOps Platform

Event: AI Infrastructure Field Day 4

Appearance: Fabrix.ai Presents at AI Infrastructure Field Day

Company: Fabrix.ai

Video Links:

Personnel: Shailesh Manjrekar

Fabrix.AI addresses the evolving AI operations landscape with an AgentOps platform that builds reliable, secure, and high-performance agents. The company, formerly Cloudfabrics.com, rebranded as Fabrix.AI in response to customer demand for agentic functionality, moving beyond traditional AIOps, which relies on manual remediation after correlation and root-cause analysis. This shift was motivated by real-world challenges, such as an 8-hour telco outage caused by inadvertent access control list changes, highlighting the need for autonomous or semi-autonomous remediation workflows powered by Large Language Models (LLMs). However, this transition introduces new complexities, including the non-deterministic nature of LLMs, context and data management at scale, and the challenge of connecting to diverse data sources, which can lead to issues such as hallucination and an “agentic value gap,” where experimental demos rarely translate to enterprise value.

Fabrix.AI’s solution centers on proprietary middleware that serves as a critical intermediary between AI agents/LLMs and various data sources. This middleware comprises two main components: the Context Engine and Universal Tooling. The Context Engine ensures “purity of context” by providing only curated, summarized data to the LLM, thereby preventing context corruption and reducing hallucination, while also maintaining state across interactions. The Universal Tooling dynamically connects to over 1,700 disparate data sources, including MCP-enabled endpoints, API-based systems, and raw or legacy data, by creating necessary wrappers and normalizing data schemas for LLM understanding, and can even dynamically generate tools by scraping public APIs. This approach allows the platform to integrate seamlessly with existing IT environments, offering a full-stack solution from data acquisition to automation.

The platform is purpose-built for real-time data environments, differentiating it from generic agentic frameworks that may not meet these requirements. It offers a “co-pilot” for conversational queries and an “Agent Studio” for building custom agents, supplementing its library of 50 out-of-the-box agents across AIOps, Observability, SecOps, and BizOps. Fabrix.AI emphasizes operationalizing agents through its AgentOps model, which incorporates trust via prompt templates and dynamic instructions, governance through FinOps models, security via a “least agency” principle, and comprehensive observability at the agentic layer with audit trails and real-time flow maps. By consolidating tools, reducing Mean Time to Resolution (MTTR) and alert noise, and enabling faster deployments, Fabrix.AI positions itself as a robust, enterprise-grade platform that complements and enhances existing observability and ITOM tools.


Resilient Wireless Networks for AI with Cisco Enterprise Networking

Event: AI Infrastructure Field Day 4

Appearance: Cisco Enterprise Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Minse Kim

Minse Kim, Cisco’s wireless product manager, emphasized that the AI era is profoundly changing enterprise networking, extending beyond data centers to encompass “physical AI” applications in factories, medical facilities, and dynamic workspaces. He noted that surging demand for AI infrastructure components is also influencing customer buying cycles, with some customers proactively investing in Wi-Fi 7 now. A key insight is that while AI infrastructure is often perceived as data center-centric, the actual consumption and training of AI models, particularly for robotics and autonomous systems, relies heavily on high-performance, low-latency wireless connectivity, making Wi-Fi 6, 6E, and 7 crucial “last mile” technologies. Cisco’s Wi-Fi 7 access points are designed to meet these demands, offering multi-gigabit speeds and backhaul capabilities up to 20 Gbps per AP.

Addressing Wi-Fi’s traditional reliability-versus-speed trade-off, Cisco has developed Ultra-Reliable Wireless Backhaul (URWB) capabilities integrated into its Wi-Fi 7 APs. By dedicating a radio, URWB provides a stable, predictable, and low-latency “wired-like” connection, which is essential for critical applications like robotics that cannot tolerate the blips and jitters common in traditional Wi-Fi during client roaming. Beyond connectivity, Cisco Wi-Fi 7 APs also enhance spatial awareness and location services. Leveraging technologies such as 802.11mc (FTM) and Ultra-Wideband (UWB) with sensor fusion, these APs deliver sub-meter (e.g., one-foot) location accuracy and low latency, resolving long-standing problems in asset tracking and network operations, as demonstrated by real-time asset tracking in an office environment. This ability to accurately digitize the physical world is fundamental for AI analytics.

Furthermore, Cisco is integrating AI into network operations to simplify management and optimize performance. For instance, AI models leverage telemetry data from 35 million Cisco APs globally to intelligently manage firmware upgrades, learning from customer rollback decisions to improve future deployments. AI also enhances Radio Resource Management (RRM) by moving beyond simple rule-based engines to intelligently optimize RF configurations, leveraging historical interference patterns and dynamically adapting to environmental changes to maximize network efficiency and stability. Cisco is even introducing the concept of APs acting as “synthetic clients” to proactively collect network statistics and provide informed recommendations. This comprehensive AI-powered approach, delivering ultra-reliable, high-speed wireless, precise spatial awareness, and intelligent network automation, is not a future vision but a current reality, with thousands of customers already using Cisco’s AI-powered network solutions.


Smarter Switching for AI with Cisco Enterprise Networking

Event: AI Infrastructure Field Day 4

Appearance: Cisco Enterprise Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Kenny Lei

The foundational goal of campus switching (providing connectivity to users and endpoints) remains unchanged, but the ecosystem it serves is undergoing rapid transformation driven by evolving applications and devices. Kenny Lei, a Technical Marketing Engineer at Cisco, highlighted the pervasive influence of AI tools like ChatGPT and GitHub Copilot, the surging adoption of Wi-Fi 7 for its increased bandwidth and user density, and the emerging security challenges posed by quantum computing. These trends necessitate a campus network capable of handling dramatically increased, often symmetric, data traffic, with higher performance, lower latency, and robust security.

To address these demands, Cisco has introduced its new “Smart Switch” series, featuring the Catalyst 9350 for access layers and the Catalyst 9610 for aggregation. The Catalyst 9350 offers high Power over Ethernet (90W) and 10Gbps copper ports, complemented by multiple 100Gbps uplinks, significantly reducing oversubscription and ensuring optimal performance for latency-sensitive AI applications. The modular Catalyst 9610, with up to 25 Terabits of performance and support for hundreds of 100Gbps ports (with future 400Gbps capabilities), serves as a high-capacity core. Both platforms are powered by Cisco Silicon One A6 ASICs, which use a virtual output queuing (VOQ) architecture to prevent head-of-line blocking and support up to seven queues for granular traffic prioritization. This intelligent design, coupled with a hybrid buffer memory system, ensures that latency-sensitive traffic is processed swiftly while bulk data transfers avoid packet drops even under congestion.

Cisco emphasizes that security is embedded in the network fabric, featuring Trust Anchor Modules (TAMs) for hardware and software integrity, IPsec/MACsec for secure transport, and a zero-trust model powered by Security Group Tags (SGTs) and the Identity Services Engine (ISE) for continuous authentication and policy enforcement. The new switches also enhance visibility and policy management through HCAM (a combination of TCAM and SRAM), enabling efficient NetFlows and ACLs while significantly reducing resource consumption. Furthermore, the enhanced CPU and memory on these smart switches allow for hosting AI workloads closer to the edge, fostering distributed intelligence and faster processing. Operational efficiency is boosted by innovations such as the eXpress Forwarding Software Upgrade (XFSU), which minimizes outage time during updates by separating the control and data planes and offloading critical processes. Cisco also integrates AI into network operations through an AI Assistant in the Meraki dashboard, streamlining day-zero, day-one, and day-N tasks from inventory management and troubleshooting to compliance checks, ensuring a high-performance, secure, and quantum-ready network infrastructure for the AI era.


Secure Routing for AI with Cisco Enterprise Networking

Event: AI Infrastructure Field Day 4

Appearance: Cisco Enterprise Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Rahul Sagi

Secure Routing with Cisco Enterprise Networking tackles the increasing complexity, user experience demands, and security requirements of modern WAN networks, especially with the advent of AI branches. Rahul Sagi introduced Cisco Secure Routers, launching in 2025, designed to converge Cisco’s best-in-class networking with advanced security in a single product. This convergence is enabled by a new Secure Networking Processor (SNP) that delivers the high throughput and capacity essential for future AI applications. These routers offer comprehensive on-box security capabilities, including a full stack of hybrid mesh firewalls with IPS/IDS, URL filtering, and AMP Threat Grid, while also supporting cloud security options for direct Internet access (DIA) use cases.

The Secure Networking Processor, an ARM-based chip with Cisco IP, is central to these innovations, enabling inline cryptographic acceleration and a natively integrated Next-Generation Firewall (NGFW) stack for superior performance. Cisco highlights significant improvements, including up to three times the IPsec performance and high security efficacy, with threat protection throughput reaching up to 11 Gbps even with all security features enabled. Addressing the impending threat of quantum computing, the new portfolio integrates Post-Quantum Cryptography (PQC) algorithms, specifically ML-CEM for key exchanges in WAN transport (IPsec and MACsec) and quantum-resistant secure boot, ensuring networks are future-proofed against quantum attacks by 2030, a critical concern for sectors like public, healthcare, retail, and finance. The secure routers also boast improved power efficiency and increased WAN interface capacities, supporting up to 100 Gbps, to handle the escalating I/O demands of AI-driven environments. Furthermore, some platforms include a dedicated AI/ML engine for local inferencing to enhance network performance in future software releases, and native zero-trust principles are embedded throughout the system.

Beyond hardware, Cisco is leveraging AI to simplify WAN operations, offering “AI for networking” tools for administrators. This includes “Branch as Code” with Cisco Validated Designs and integration into CI/CD pipelines for automated, scalable deployments across hundreds of sites. The AI Assistant in management solutions such as Catalyst SD-WAN Manager and the Meraki dashboard streamlines configuration and troubleshooting. Specific AI-powered features include Predictive Path Recommendations, which analyze historical network behavior to suggest optimal transport paths for applications at specific times, and Bandwidth Forecasting, which helps predict and plan for circuit upgrades. Anomaly Detection continuously monitors network attributes such as round-trip time, jitter, and loss to proactively alert administrators to anomalous behavior, reducing troubleshooting time. These combined efforts aim to deliver AI-ready networking products, simplify WAN operations with intelligent tools, and reduce risk across all layers with robust, future-proof security controls.


Cisco Enterprise Networking Platform Approach

Event: AI Infrastructure Field Day 4

Appearance: Cisco Enterprise Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Shai Silberman

Cisco is unifying its enterprise networking platforms (Meraki and Catalyst) to deliver a single, consistent user experience with common AI and data services, consistent APIs, and shared workflows across cloud, on-prem, and hybrid deployments. This unification began with the creation of a dedicated network platform team that brought together the Meraki and Catalyst groups to foster a “build once, deploy twice” philosophy. New Cisco hardware, including switches, wireless routers, and IoT equipment, now supports both cloud and on-premises management out of the box, allowing customers to choose their preferred management method without making purchasing decisions based on deployment. This approach ensures consistent outcomes and experiences by leveraging the same underlying engines and logic across all platforms.

The convergence journey also includes a unified hardware and licensing model, a “magnetic UI framework” for a common user experience across all Cisco products, and consistent APIs. These APIs enable common tasks, infrastructure as code, and robust integrations with third-party systems such as ServiceNow and Splunk, as exemplified by the API-driven setup of the Paris Olympics infrastructure. At the core is a common AI and data layer, powered by a single Cisco cloud and shared algorithms. This enables deployment of the AI Assistant chatbot on both the Meraki Dashboard (generally available) and the Catalyst Center (open beta), using the same backend to deliver identical experiences and use cases. Additionally, Cisco Workflows, a free low-code solution, is integrated into the Meraki interface, offering templates and horizontal integration across domains and even other vendor products via APIs.

Further advancing management capabilities, Cisco introduced “Global Overview,” a generally available cloud-based product designed for customers operating both cloud and on-premises infrastructures. Global Overview provides a single cloud experience to integrate multiple Meraki organizations and Catalyst Centers, offering consolidated network health visibility, unified inventory, and single sign-on for seamless cross-launching into specific management platforms. Complementing the AI Assistant, AI Canvas (currently in alpha) offers cross-domain collaboration and troubleshooting by integrating multiple data sources and third-party applications via natural language AI agents. Cisco’s AI is powered by a proprietary “deep networking model,” a purpose-built Large Language Model trained on Cisco’s extensive knowledge base, including TAC and CX insights, to deliver highly specific, accurate networking solutions without using customer data and to continuously learn from live telemetry. This innovative approach aims to accelerate root-cause analysis and provide automated remediation while maintaining a human-in-the-loop model to build customer trust.


Cisco Enterprise Networking Vision, Strategy, and Execution

Event: AI Infrastructure Field Day 4

Appearance: Cisco Enterprise Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Kiran Ghodgaonkar

Cisco presents its enterprise networking vision and strategy, detailing how it is executed from a platform perspective, particularly in the context of the rapidly evolving AI era. Kiran Ghodgaonkar, who leads product marketing for Cisco’s Secure WAN portfolio, introduced the session and outlined how the company is adapting its familiar routing, switching, wireless, and management products. With over 40 years of history, Cisco has been at the forefront of innovation through previous disruptions, including the internet, mobile, and cloud eras, consistently focusing on connecting people to users and applications. The current AI era, however, necessitates a fundamental rethink of how networking products are built to adapt to evolving application and data consumption.

In this new landscape, Cisco observes three consistent themes among its customers: increasing complexity from diverse devices and disparate product stacks; significant IT hiring and budget constraints exacerbated by a skills gap in networking and security; and the challenge of deploying long-lived networking equipment in a fast-evolving AI environment. To address these concerns and build an AI-ready, secure network, Cisco’s strategy is founded on three key pillars. First, it focuses on simplifying operations through Agentic Ops to assist IT leaders. Second, the strategy emphasizes integrating security directly into the network, leveraging it as a primary line of defence against emerging threats such as deepfakes and data leakage, while also adhering to new standards such as NIST post-quantum cryptography. Finally, Cisco aims to develop scalable AI-optimized devices that can simultaneously handle networking and security functions with low latency for demanding AI workloads.

Building hardware for the AI era means a significant evolution in Cisco’s approach. This includes developing custom silicon to deliver high bandwidth, performance, post-quantum readiness, and integrated security, moving beyond the limitations of off-the-shelf solutions. Enhanced observability, including deep packet inspection, is also crucial. For its operating system, IOS XE, Cisco is focused on easier deployment and upgrades without downtime, deep observability, efficient container execution, and robust programmability to support secure API communication for telemetry and management tools. From a broader systems perspective, the company is prioritizing visibility, programmability, and the maintenance of an open, interoperable ecosystem. A critical consideration for these systems is power efficiency, acknowledging networking equipment’s energy consumption and the growing importance of sustainability and carbon footprint management globally.


Building AI Pods with Nexus Hyperfabric from Cisco

Event: AI Infrastructure Field Day 4

Appearance: Cisco Data Center Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Alex Burger, Dan Backman

This presentation introduces Cisco Nexus Hyperfabric, a cloud-managed platform that simplifies the deployment and ongoing management of AI infrastructure. It addresses the growing need for repeatable, scalable, and operationally efficient networks specifically for enterprise AI clusters. Cisco emphasizes that while hyperscalers build immense AI factories, a significant and growing market exists for smaller, enterprise-level AI deployments, often below 256 nodes, which they term “AI Clusters for the Rest of Us.”

The shift to these smaller, on-premises AI clusters is driven by several factors, including the increasing size and sensitivity of data (e.g., healthcare, intellectual property), making cloud undesirable, a trend of workloads returning from the cloud, and the need for project- or application-specific infrastructure rather than shared general-purpose IT. The rapidly evolving AI technology also means enterprises prefer incremental build-outs rather than massive, infrequent investments, allowing them to leverage newer generations of hardware more frequently. However, designing and deploying these dense, complex, lossless Ethernet networks is challenging and time-consuming for traditional network practitioners, often involving weeks of design, lengthy procurement, and meticulous cabling.

Cisco Nexus Hyperfabric addresses these challenges by delivering a Meraki-like SaaS experience for data center network deployment. It offers pre-designed, NVIDIA ERA-compliant templates for AI clusters that automate the generation of a complete bill of materials, including optics and cables. This drastically reduces design time and eliminates manual errors, accelerating the “time to first token” for AI projects. Hyperfabric also streamlines day-one operations with step-by-step cabling instructions and real-time validation via server-side agents, ensuring correct physical connectivity. Beyond deployment, it provides end-to-end network visibility, proactive monitoring of components such as optics, and integrates advanced Ethernet features, including lossless capabilities (PFC, ECN) and adaptive routing, to optimize performance for demanding AI workloads.


Cisco Reference Architectures for AI Networking with the Nexus Dashboard

Event: AI Infrastructure Field Day 4

Appearance: Cisco Data Center Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Meghan Kachhi, Richard Licon

Cisco provides comprehensive reference architectures for AI networking, scalable from small 96-GPU clusters up to massive 32,000-GPU deployments. These designs, available on Cisco.com and Nvidia.com, are vendor-agnostic, supporting Nvidia, AMD, and Intel. The core focus is to simplify operations for customers, ensuring ease of design at scale while maintaining automation and end-to-end visibility. This is achieved through the Nexus Dashboard platform, which streamlines the complex requirements of AI infrastructure.

The Nexus Dashboard significantly simplifies AI networking management. It enables customers to quickly create AI fabrics, choosing between routed or VXLAN EVPN options, with best-practice configurations for lossless fabrics, including QoS, ECN, and PFC, automatically applied. The platform also enables easy activation of advanced features, such as Dynamic Load Balancing (DLB), with minimal clicks. It facilitates the discovery and onboarding of switches into the AI fabric, organizes them into scalable units, and provides guardrails against misconfigurations. Customers can manage their AI clusters seamlessly alongside traditional data center and storage fabrics, leveraging a unified dashboard that offers clear topology views and inventory details of switches, interfaces, and connected GPUs.

Beyond setup and management, the Nexus Dashboard provides critical visibility into AI jobs and troubleshooting capabilities. Integrating with workload managers like Slurm enables users to monitor AI jobs and correlate network performance with GPU and NIC issues. The dashboard offers an “at a glance” view of AI resources, highlighting anomalies and advisories. Users can drill down into specific jobs to visualize resource utilization and pinpoint performance bottlenecks. Detailed analytics provide insights into Ethernet interface drops, CRC errors, and GPU-specific metrics, including temperature, utilization, and power. The platform generates job-specific topologies, identifies anomalies down to individual links and GPUs, and provides actionable insights for root-cause analysis and resolution. For customers seeking integration with multi-vendor environments or custom automation workflows, the Nexus Dashboard also provides a comprehensive set of APIs that complement Cisco’s broader AI Canvas for multi-domain orchestration.


Cisco AI Cluster Design, Automation, and Visibility

Event: AI Infrastructure Field Day 4

Appearance: Cisco Data Center Networking Presents at AI Infrastructure Field Day

Company: Cisco

Video Links:

Personnel: Meghan Kachhi, Richard Licon

Cisco’s presentation on AI Cluster Design, Automation, and Visibility, led by Meghan Kachhi and Richard Licon, aims to simplify AI infrastructure and address the challenges of lengthy design and troubleshooting cycles for GPU clusters. The core focus is on enhancing cluster designs, automating deployments, and providing end-to-end visibility to protect a competitive edge. The session outlines Cisco’s reference architectures, key components for building AI clusters, and upcoming updates to its Nexus Dashboard platform, which is expected to streamline design, automation, and monitoring at scale. This comprehensive approach is crucial because the battle for AI success lies at the infrastructure layer, ensuring GPUs are not underutilized by network inefficiencies.

Cisco leverages three unique pillars in its AI networking strategy. Firstly, its systems feature custom Silicon One platforms with programmable pipelines that quickly adapt to evolving AI infrastructure demands, and a partnership with NVIDIA that provides NX-OS on NVIDIA Spectrum X silicon to ensure full-stack reference architecture compliance. Rigorously tested transceivers and a mature NX-OS software, now optimized for AI workloads, complete the system offerings. Secondly, the operating model includes the Nexus Dashboard for on-premises management and Nexus Hyperfabric for a full-stack, cloud-managed solution, complemented by an API-first approach to seamless integration with existing customer automation frameworks. Thirdly, extensive AI reference architectures serve as validated blueprints, spanning enterprise-scale deployments (under 1024 GPUs) to hyperscale cloud environments (1K-16K+ GPUs), providing detailed component lists and ensuring a consistent networking experience across vendors such as NVIDIA, AMD, and storage solutions. An AI cluster is broadly defined to encompass front-end, storage, and backend GPU-to-GPU networks, with a growing trend toward convergence enabled by high-speed Ethernet to unify operating models.

Designing an efficient AI backend network requires a non-blocking architecture that maintains a 1:1 subscription ratio, keeping every GPU within one hop of others for optimal communication. Cisco employs a “scalable unit” concept, enabling incremental expansion by repeating validated blocks while adjusting spine-layer connectivity to maintain high performance. For smaller-scale deployments, such as a 32-GPU university cluster, Cisco demonstrates how front-end, storage, and backend networks can be converged onto fewer, high-density switches, simplifying infrastructure. A critical consideration for such converged environments is Cisco’s policy-based load balancing, an innovation leveraging Silicon One ASICs. This enables preferential treatment of critical traffic, such as GPU-to-GPU training, over storage or front-end traffic, ensuring AI jobs run with minimal latency and maximum GPU utilization, even when sharing network resources.


Practical AI for Business Growth

Event: AI Field Day 7

Appearance: Utilizing AI Podcast

Company: The Futurum Group

Video Links:

Personnel: Nick Patience, Stephen Foskett

Stephen Foskett and Nick Patience are introducing a new podcast called Utilizing AI, where they focus on the practical applications of artificial intelligence in enterprises to drive efficiency, improve decision-making, and foster innovation. They invite listeners to participate in weekly episodes that discuss practical AI applications within various business sectors, offering insights and examples of real-world outcomes from incorporating AI technologies. The episodes will also explore AI Field Day events and analyze insights from The Futurum Group’s research and AI practice.

The podcast’s first episode was recorded live during AI Field Day in Santa Clara, bringing various groups together, including individuals from TechStrong, Tech Field Day, and Futurum’s research and analyst team. They aim to present different perspectives on how AI can be transformative within enterprise IT, presenting AI as a pervasive general-purpose technology impacting various facets of business operations. Highlights from the discussion included upcoming AI Field Day events, planned appearances at industry conferences, and an exploratory overview of significant players and their roles, such as AWS, Google, and Oracle, in the AI ecosystem. The podcast intends to focus on keeping up with the fast pace of AI developments and distinguishing meaningful trends and innovations from fleeting ones.

They shared insights into AI’s impact on industries and detailed upcoming plans and discussions related to AI usage. As AI continues to evolve and impact various industries differently, there’s an emphasis on enterprise adoption of AI technologies to automate processes, improve customer service, and optimize operations. The team plans to highlight practical examples and discuss how businesses can navigate the rapidly-changing AI landscape, balancing the fear of missing out with the day-to-day operational needs. The podcast is set to respond to weekly changes and announcements in the AI field, continuously aiming to inform and discuss the current state and future of AI technologies.


Considering ResOps – a Tech Field Day Roundtable at Commvault SHIFT 2025

Event: Tech Field Day Experience at Commvault SHIFT 2025

Appearance: Commvault SHIFT Roundtable Discussion

Company: Commvault

Video Links:

Personnel: Jay Cuthrell, Karen Lopez, Michael Stempf, Shala Warner, Stephen Foskett, Tom Hollingsworth

At the Commvault SHIFT 2025 Tech Field Day Roundtable in New York City, moderator Stephen Foskett convened a panel of industry experts to discuss the latest trends in data protection, resilience, and artificial intelligence. The panel included Jay Cuthrell, Karen Lopez, Shala Warner, and Tom Hollingsworth, as well as Michael Stempf from Commvault, each bringing perspectives from security, data management, DevOps, and cloud architecture. The discussion focused on Commvault’s strategic announcements around ResOps—an emerging discipline combining practices from DevOps, SecOps, and FinOps into a holistic approach to cyber resilience. Panelists noted the importance of cross-team collaboration, integrations with major cloud and security platforms, and the convergence of operational practices, all of which align with the increasing complexity of enterprise IT environments and the growing threat landscape fueled by AI-driven attacks.

A key topic was the shift from traditional disaster recovery (DR) and backup, which assumed non-malicious outages, towards a mindset anchored in active defense against adversarial threats like ransomware. Jay Cuthrell and Tom Hollingsworth highlighted innovations such as synthetic restore—a method to selectively recover clean data and minimize downtime after an attack—as well as the crucial role of identifying attack persistence in overlooked areas like Active Directory. The panel emphasized the necessity of incorporating AI for faster detection and remediation, but also pointed out the risk of AI-generated threats and the importance of comprehensive data inventories. Karen Lopez stressed that recovery, not just backup, should be the ultimate goal, asserting that organizations need robust strategies to know what data they have, where it lives, and how it is being protected.

The roundtable concluded that Commvault’s announced direction—moving beyond storage toward broader cyber and AI resilience—was credible and matched the realities of modern IT. Panelists praised new capabilities such as conversational interfaces and integrations with collaboration tools (e.g., Office 365, Google Workspace, and cloud-native databases), while also pointing to the need for organizations to invest in people and processes, not just technology. The panel agreed that cyber resiliency is now a “team sport,” requiring cooperation across IT, security, legal, and business units, facilitated by intelligent automation and education programs. The event served as both a showcase of Commvault’s evolution and a broader industry call to arms for holistic, AI-aware data protection.


What Would You Send a Cloud Scout to Fix with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

This segment grounds the idea in practice. We’ll examine how embedded engineers have helped product teams go beyond reactive fixes — from automating post-mortems to co-designing self-healing infrastructure and predictive testing frameworks. The focus is on what changes when teams own reliability together: faster iteration, fewer handoffs, and more precise success metrics. We’ll close with an open discussion on how organizations can experiment with the Cloud Scout model — and what it signals for the next evolution of DevOps.

The presentation addresses the challenge of organizations needing to adopt new technologies, such as AI, but facing uncertainty and risk. The Cloud Scout model is presented as a way to mitigate these risks by embedding engineers to assess the current state, identify opportunities, and demonstrate the value of new tools and practices. The goal is to de-risk innovation and empower teams to embrace change, particularly concerning AI adoption, which is driven by business mandates but often faces resistance due to security concerns or a lack of clear implementation strategies.

A key aspect of the Cloud Scout approach is its focus on practical application and measurable business outcomes. The scouts aim to demonstrate, not just tell, how AI can be utilized to achieve specific goals, such as reducing alert fatigue or enhancing efficiency. While the initial engagement is typically a 40-hour-a-week commitment for three months to understand the problem and prototype a solution, it can evolve into a fractional engagement with a specialist or lead to a separate project for building out the solution. This approach emphasizes the importance of senior expertise in navigating uncertainty and mitigating risk associated with new technology adoption, ultimately enabling organizations to become more mature and effectively embrace innovation.


Demonstrating AI-Assisted Development for Leading European Streaming Service with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

A Cloud Scou is a forward-deployed engineer who joins the product team to co-own reliability, scalability, and evolution. Drawing from the Forward-Deployed Engineer for SR and AI-Managed DevCrew models, Scouts act as both architectural advisors and implementers — blending human judgment with AI-driven companions to build, test, and tune cloud-native systems. We walk through how this embedded approach fosters continuous improvement, strengthens technical decision-making, and creates a shared sense of accountability between Dev, Ops, and AI.

Johnny Halife from SOUTHWORKS presented an example of their work with a European streaming service facing issues with their electronic program guide (EPG). The EPG, built on Node.js, Lambda, S3, BigQuery, and XML, was experiencing blank displays due to ingestion problems. The issue was traced to an unexpected 413 error indicating that the request entity was too large, specifically related to image transformation failures. This problem was impacting viewers, who were seeing blank screens.

To address this, SOUTHWORKS employed a Cloud Scout, leveraging tools such as GitHub Copilot and their own MCP servers, which are connected to AWS CloudWatch. The process began with the scout prompting GitHub Copilot to create a Jira ticket, which was then assigned. The agent analyzed the error by running CloudWatch MCP, finding related logs, and contextualizing them within the solution codebase. This analysis revealed a missing validation and a data conflict between files, providing evidence-backed insights. The agent then proposed solutions, including code changes, which were compiled into a pull request.

The final step involved a code review by the Scout, along with standard organizational pre- and post-requisites, including SonarQube and linting. This process, previously taking days, was reduced to a few hours. By implementing this AI-assisted approach, the streaming service experienced faster issue resolution, fewer noisy alerts, and predictive scoring for deployments, resulting in a significant reduction in recovery time. This approach enabled them to transition from a defensive strategy of increased monitoring and tooling to a proactive approach, aimed at preventing issues before they arise by analyzing past incidents and identifying potential risks.


Cloud Scouts are Embedded (Human) Builders for the Cloud Native Frontier with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

Most SRE teams were designed to ensure uptime, not to evolve products. They monitor systems but often lack the context to influence architecture or design. This segment examines why proximity — being part of the product team — is what transforms reliability into progress. When engineers operate as embedded partners, they surface deeper insights, close the gap between observation and action, and help the team own the outcome end to end.

Johnny Halife from SOUTHWORKS introduced Cloud Scouts, a new service designed to bridge the gap between SRE teams and product development teams. He explained how traditional SRE practices, while valuable for maintaining uptime and monitoring applications, often operate in silos, leading to “ticket battles” and a lack of context when issues arise. This SOUTHWORKS service addresses the evolving need for engineers who understand both the application and the platform, and can work directly with development teams to identify the root cause of problems, not just surface-level errors.

Cloud Scouts are senior software engineers who are embedded within customer teams, possessing domain expertise and the ability to quickly prototype solutions. They actively engage with software engineers, platform engineers, and architects to foster better communication and collaboration. These scouts also use AI-powered “companions” to analyze telemetry data, identify patterns, and propose fixes, while always maintaining human oversight to ensure accuracy and alignment with business goals. It is intended to provide hands-on support, rather than a consulting position.

The goal of Cloud Scouts is not to replace existing SRE or development teams, but to enhance their effectiveness by facilitating knowledge sharing, promoting end-to-end ownership, and accelerating the adoption of new technologies, such as AI. The engagement begins with a three-month assessment to evaluate the current state and establish a baseline, to achieve measurable improvements in areas such as alert fatigue, time to resolution, and overall system reliability. SOUTHWORKS emphasizes transparency and a collaborative approach, empowering customers to mature their practices and become more self-sufficient over time.


Cisco AI Networking Vision and Operational Strategies by Arun Annavarapu

Event: Networking Field Day 39

Appearance: Cisco Presents at Networking Field Day 39

Company: Cisco

Video Links:

Personnel: Arun Annavarapu

Arun Annavarapu, Director of Product Management for Cisco’s Data Center Networking Group, opened the presentation by framing the massive industry shift towards AI. He noted that the evolution from LLMs to agentic AI and edge inferencing creates an AI continuum that places unprecedented demands on the underlying infrastructure. The network is the key component, tasked with supporting new scale-up, scale-out, and even scale-across fabrics that connect data centers across geographies. Anavarpu emphasized that the network is no longer just a pipe. It must be available, lossless, resilient, and secure. He stressed that any network problems will directly correlate to poor GPU utilization, making network reliability essential for protecting the significant financial investment in AI infrastructure.

Cisco’s strategy to meet these challenges is to provide a complete, end-to-end solution that spans from its custom silicon and optics to the hardware, software, and the operational model. A critical piece of this strategy is simplifying the operating model for these complex AI networks. This model is designed to provide easy day-zero provisioning, allowing operators to deploy entire AI fabrics with a few clicks rather than pages of configuration. This is complemented by deep day-two visibility through telemetry, analytics, and proactive remediation, all managed from a single pane of glass that provides a unified view across all fabric types.

To deliver this operational model, Cisco offers two primary form factors. The first is the Nexus Dashboard, a unified, on-premises solution that allows customers to manage their own provisioning, security, and analytics for AI fabrics. The second option is HyperFabric AI, a SaaS-based platform where Cisco manages the management software, offering a more hands-off, cloud-driven experience. Anavarpu explained that both of these solutions can feed data into higher-level aggregation layers like AI Canvas and Splunk. These tools provide cross-product correlation and advanced analytics, enabling the faster troubleshooting and operational excellence required by the new age of AI.


Run Traefik Anywhere with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami highlights Traefik’s ability to run in diverse environments, addressing the growing trend of multi-environment deployments. Traefik is not limited to Kubernetes, but can operate on Linux, within Docker containers, HashiCorp Nomad, and across any certified Kubernetes distribution. Sudeep demonstrated the ease of installation using a consistent Helm chart across different environments, including AKS, EKS, GKE, and OKE, emphasizing that the same command applies universally.

The demonstration also included deploying Traefik at the edge, highlighting its default ingress status in K3S and seamless integration with platforms such as DigitalOcean Kubernetes, Canonical MicroK8S, and Linode’s LKE, provided they are CNCF-compatible. Further, the ability to run Traefik offline in air-gapped mode was highlighted, requiring only a simple flag adjustment. This capability ensures complete isolation without telemetry or “call home” features, broadening Traefik’s applicability in highly secure environments.

In conclusion, the presentation highlighted three key takeaways: achieving operational leverage through unified application intelligence, gaining architectural control with decoupled AI runtime environments, and ensuring true deployment sovereignty by running Traefik anywhere. These points address the challenges posed by the coexistence of VMs, containers, and serverless architectures, advocating for a unified application routing layer that simplifies management and control across diverse environments. The ability to run Traefik anywhere is positioned as crucial for achieving true sovereignty, as it avoids vendor lock-in and allows organizations to move freely between different cloud providers and environments.


Accelerating AI In the Enterprise with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami from Traefik Labs presented at Tech Field Day at KubeCon North America 2025, focusing on accelerating AI in the enterprise using Traefik Labs’ runtime gateway. Given the rapid proliferation of AI models, decoupling applications from specific models becomes crucial to avoid constant refactoring. Traefik Labs advocates for decoupling at the gateway layer, providing operational freedom and leverage.

The gateway’s role extends beyond simple routing, encompassing critical functions like authentication, rate limiting, and implementing guardrails to ensure AI usage aligns with enterprise policies. These guardrails prevent misuse, such as finance agents answering legal questions. Caching at the gateway also optimizes token consumption. Sudeep emphasized that while such logic can be embedded in applications, production environments benefit from consolidating it at the gateway for scalability, performance, unified control, and observability.

The presentation introduced the concept of a “triple gate pattern” for agentic workflows involving interactions with LLMs, MCP resources, and backend APIs. This necessitates AI gateways, MCP gateways, and traditional API gateways, ideally within a single binary to simplify deployment and management. Decoupling the API runtime from the model runtime is crucial, acknowledging the rapid evolution of AI models. Sudeep emphasized that no single model will ultimately dominate.


Run Your AI and APIs Anywhere with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami, CEO of Traefik Labs, began the presentation by introducing Traefik Labs and outlining three main topics. The first was the concept of unified application intelligence. Second, the acceleration of AI in the enterprise. Finally, running applications anywhere, emphasizing freedom of choice, including public cloud, edge environments, and air-gapped environments. All growing themes at KubeCon.

Traefik is one of the most downloaded API gateways, boasting over 3.4 billion downloads on Docker Hub. The open-source project, Traffic Proxy, boasts a vibrant community with over 800 contributors, and its latest version, V3.6, features numerous enhancements driven by the community. Traefik is known for its intuitive interface, ease of use, and powerful capabilities, with a fully declarative infrastructure-as-code deployment model. The primary users and advocates are DevOps engineers, platform teams, and SREs, with increasing adoption by security and AIOps teams due to the agility and user experience it provides for AI workloads.

The core of Traefik Labs’ portfolio is Traefik Hub, which offers an open-source ingress controller, a licensed API gateway, and API management. Differentiators include excellent documentation, being Kube-native, an intuitive UI, and a focus on day two operations. They are fully declarative, embracing the GitOps model and CI/CD pipelines, enabling effective change management. Traefik’s pricing model is cluster-based or instance-based, unlike competitors that charge based on request volume, providing more predictable budgeting. The licensing also provides a safe zone for bursting without penalizing for autoscaling.