What Would You Send a Cloud Scout to Fix with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

This segment grounds the idea in practice. We’ll examine how embedded engineers have helped product teams go beyond reactive fixes — from automating post-mortems to co-designing self-healing infrastructure and predictive testing frameworks. The focus is on what changes when teams own reliability together: faster iteration, fewer handoffs, and more precise success metrics. We’ll close with an open discussion on how organizations can experiment with the Cloud Scout model — and what it signals for the next evolution of DevOps.

The presentation addresses the challenge of organizations needing to adopt new technologies, such as AI, but facing uncertainty and risk. The Cloud Scout model is presented as a way to mitigate these risks by embedding engineers to assess the current state, identify opportunities, and demonstrate the value of new tools and practices. The goal is to de-risk innovation and empower teams to embrace change, particularly concerning AI adoption, which is driven by business mandates but often faces resistance due to security concerns or a lack of clear implementation strategies.

A key aspect of the Cloud Scout approach is its focus on practical application and measurable business outcomes. The scouts aim to demonstrate, not just tell, how AI can be utilized to achieve specific goals, such as reducing alert fatigue or enhancing efficiency. While the initial engagement is typically a 40-hour-a-week commitment for three months to understand the problem and prototype a solution, it can evolve into a fractional engagement with a specialist or lead to a separate project for building out the solution. This approach emphasizes the importance of senior expertise in navigating uncertainty and mitigating risk associated with new technology adoption, ultimately enabling organizations to become more mature and effectively embrace innovation.


Demonstrating AI-Assisted Development for Leading European Streaming Service with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

A Cloud Scou is a forward-deployed engineer who joins the product team to co-own reliability, scalability, and evolution. Drawing from the Forward-Deployed Engineer for SR and AI-Managed DevCrew models, Scouts act as both architectural advisors and implementers — blending human judgment with AI-driven companions to build, test, and tune cloud-native systems. We walk through how this embedded approach fosters continuous improvement, strengthens technical decision-making, and creates a shared sense of accountability between Dev, Ops, and AI.

Johnny Halife from SOUTHWORKS presented an example of their work with a European streaming service facing issues with their electronic program guide (EPG). The EPG, built on Node.js, Lambda, S3, BigQuery, and XML, was experiencing blank displays due to ingestion problems. The issue was traced to an unexpected 413 error indicating that the request entity was too large, specifically related to image transformation failures. This problem was impacting viewers, who were seeing blank screens.

To address this, SOUTHWORKS employed a Cloud Scout, leveraging tools such as GitHub Copilot and their own MCP servers, which are connected to AWS CloudWatch. The process began with the scout prompting GitHub Copilot to create a Jira ticket, which was then assigned. The agent analyzed the error by running CloudWatch MCP, finding related logs, and contextualizing them within the solution codebase. This analysis revealed a missing validation and a data conflict between files, providing evidence-backed insights. The agent then proposed solutions, including code changes, which were compiled into a pull request.

The final step involved a code review by the Scout, along with standard organizational pre- and post-requisites, including SonarQube and linting. This process, previously taking days, was reduced to a few hours. By implementing this AI-assisted approach, the streaming service experienced faster issue resolution, fewer noisy alerts, and predictive scoring for deployments, resulting in a significant reduction in recovery time. This approach enabled them to transition from a defensive strategy of increased monitoring and tooling to a proactive approach, aimed at preventing issues before they arise by analyzing past incidents and identifying potential risks.


Cloud Scouts are Embedded (Human) Builders for the Cloud Native Frontier with SOUTHWORKS

Event: Tech Field Day at KubeCon North America 2025

Appearance: Southworks Presents at Tech Field Day at KubeCon North America 2025

Company: SOUTHWORKS

Video Links:

Personnel: Johnny Halife

Most SRE teams were designed to ensure uptime, not to evolve products. They monitor systems but often lack the context to influence architecture or design. This segment examines why proximity — being part of the product team — is what transforms reliability into progress. When engineers operate as embedded partners, they surface deeper insights, close the gap between observation and action, and help the team own the outcome end to end.

Johnny Halife from SOUTHWORKS introduced Cloud Scouts, a new service designed to bridge the gap between SRE teams and product development teams. He explained how traditional SRE practices, while valuable for maintaining uptime and monitoring applications, often operate in silos, leading to “ticket battles” and a lack of context when issues arise. This SOUTHWORKS service addresses the evolving need for engineers who understand both the application and the platform, and can work directly with development teams to identify the root cause of problems, not just surface-level errors.

Cloud Scouts are senior software engineers who are embedded within customer teams, possessing domain expertise and the ability to quickly prototype solutions. They actively engage with software engineers, platform engineers, and architects to foster better communication and collaboration. These scouts also use AI-powered “companions” to analyze telemetry data, identify patterns, and propose fixes, while always maintaining human oversight to ensure accuracy and alignment with business goals. It is intended to provide hands-on support, rather than a consulting position.

The goal of Cloud Scouts is not to replace existing SRE or development teams, but to enhance their effectiveness by facilitating knowledge sharing, promoting end-to-end ownership, and accelerating the adoption of new technologies, such as AI. The engagement begins with a three-month assessment to evaluate the current state and establish a baseline, to achieve measurable improvements in areas such as alert fatigue, time to resolution, and overall system reliability. SOUTHWORKS emphasizes transparency and a collaborative approach, empowering customers to mature their practices and become more self-sufficient over time.


Cisco AI Networking Vision and Operational Strategies by Arun Annavarapu

Event: Networking Field Day 39

Appearance: Cisco Presents at Networking Field Day 39

Company: Cisco

Video Links:

Personnel: Arun Annavarapu

Arun Annavarapu, Director of Product Management for Cisco’s Data Center Networking Group, opened the presentation by framing the massive industry shift towards AI. He noted that the evolution from LLMs to agentic AI and edge inferencing creates an AI continuum that places unprecedented demands on the underlying infrastructure. The network is the key component, tasked with supporting new scale-up, scale-out, and even scale-across fabrics that connect data centers across geographies. Anavarpu emphasized that the network is no longer just a pipe. It must be available, lossless, resilient, and secure. He stressed that any network problems will directly correlate to poor GPU utilization, making network reliability essential for protecting the significant financial investment in AI infrastructure.

Cisco’s strategy to meet these challenges is to provide a complete, end-to-end solution that spans from its custom silicon and optics to the hardware, software, and the operational model. A critical piece of this strategy is simplifying the operating model for these complex AI networks. This model is designed to provide easy day-zero provisioning, allowing operators to deploy entire AI fabrics with a few clicks rather than pages of configuration. This is complemented by deep day-two visibility through telemetry, analytics, and proactive remediation, all managed from a single pane of glass that provides a unified view across all fabric types.

To deliver this operational model, Cisco offers two primary form factors. The first is the Nexus Dashboard, a unified, on-premises solution that allows customers to manage their own provisioning, security, and analytics for AI fabrics. The second option is HyperFabric AI, a SaaS-based platform where Cisco manages the management software, offering a more hands-off, cloud-driven experience. Anavarpu explained that both of these solutions can feed data into higher-level aggregation layers like AI Canvas and Splunk. These tools provide cross-product correlation and advanced analytics, enabling the faster troubleshooting and operational excellence required by the new age of AI.


Run Traefik Anywhere with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami highlights Traefik’s ability to run in diverse environments, addressing the growing trend of multi-environment deployments. Traefik is not limited to Kubernetes, but can operate on Linux, within Docker containers, HashiCorp Nomad, and across any certified Kubernetes distribution. Sudeep demonstrated the ease of installation using a consistent Helm chart across different environments, including AKS, EKS, GKE, and OKE, emphasizing that the same command applies universally.

The demonstration also included deploying Traefik at the edge, highlighting its default ingress status in K3S and seamless integration with platforms such as DigitalOcean Kubernetes, Canonical MicroK8S, and Linode’s LKE, provided they are CNCF-compatible. Further, the ability to run Traefik offline in air-gapped mode was highlighted, requiring only a simple flag adjustment. This capability ensures complete isolation without telemetry or “call home” features, broadening Traefik’s applicability in highly secure environments.

In conclusion, the presentation highlighted three key takeaways: achieving operational leverage through unified application intelligence, gaining architectural control with decoupled AI runtime environments, and ensuring true deployment sovereignty by running Traefik anywhere. These points address the challenges posed by the coexistence of VMs, containers, and serverless architectures, advocating for a unified application routing layer that simplifies management and control across diverse environments. The ability to run Traefik anywhere is positioned as crucial for achieving true sovereignty, as it avoids vendor lock-in and allows organizations to move freely between different cloud providers and environments.


Accelerating AI In the Enterprise with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami from Traefik Labs presented at Tech Field Day at KubeCon North America 2025, focusing on accelerating AI in the enterprise using Traefik Labs’ runtime gateway. Given the rapid proliferation of AI models, decoupling applications from specific models becomes crucial to avoid constant refactoring. Traefik Labs advocates for decoupling at the gateway layer, providing operational freedom and leverage.

The gateway’s role extends beyond simple routing, encompassing critical functions like authentication, rate limiting, and implementing guardrails to ensure AI usage aligns with enterprise policies. These guardrails prevent misuse, such as finance agents answering legal questions. Caching at the gateway also optimizes token consumption. Sudeep emphasized that while such logic can be embedded in applications, production environments benefit from consolidating it at the gateway for scalability, performance, unified control, and observability.

The presentation introduced the concept of a “triple gate pattern” for agentic workflows involving interactions with LLMs, MCP resources, and backend APIs. This necessitates AI gateways, MCP gateways, and traditional API gateways, ideally within a single binary to simplify deployment and management. Decoupling the API runtime from the model runtime is crucial, acknowledging the rapid evolution of AI models. Sudeep emphasized that no single model will ultimately dominate.


Run Your AI and APIs Anywhere with Traefik Labs

Event: Tech Field Day at KubeCon North America 2025

Appearance: Traefik Labs Presents at Tech Field Day at KubeCon North America 2025

Company: Traefik Labs

Video Links:

Personnel: Sudeep Goswami

Sudeep Goswami, CEO of Traefik Labs, began the presentation by introducing Traefik Labs and outlining three main topics. The first was the concept of unified application intelligence. Second, the acceleration of AI in the enterprise. Finally, running applications anywhere, emphasizing freedom of choice, including public cloud, edge environments, and air-gapped environments. All growing themes at KubeCon.

Traefik is one of the most downloaded API gateways, boasting over 3.4 billion downloads on Docker Hub. The open-source project, Traffic Proxy, boasts a vibrant community with over 800 contributors, and its latest version, V3.6, features numerous enhancements driven by the community. Traefik is known for its intuitive interface, ease of use, and powerful capabilities, with a fully declarative infrastructure-as-code deployment model. The primary users and advocates are DevOps engineers, platform teams, and SREs, with increasing adoption by security and AIOps teams due to the agility and user experience it provides for AI workloads.

The core of Traefik Labs’ portfolio is Traefik Hub, which offers an open-source ingress controller, a licensed API gateway, and API management. Differentiators include excellent documentation, being Kube-native, an intuitive UI, and a focus on day two operations. They are fully declarative, embracing the GitOps model and CI/CD pipelines, enabling effective change management. Traefik’s pricing model is cluster-based or instance-based, unlike competitors that charge based on request volume, providing more predictable budgeting. The licensing also provides a safe zone for bursting without penalizing for autoscaling.


The Advantages of Running vSphere Kubernetes Service in a Private Cloud with VMware by Broadcom

Event: Tech Field Day at KubeCon North America 2025

Appearance: VMware by Broadcom Presents at Tech Field Day at KubeCon North America 2025

Company: VMware by Broadcom

Video Links:

Personnel: Jeremy Wolf

This session will emphasize the tangible benefits of running vSphere Kubernetes Service in a private cloud environment. We’ll reinforce how VKS simplifies operations within a private cloud context, highlighting minimal networking expertise required, self-service capabilities, and the powerful synergies with VCF services for an optimal private cloud solution.

Jeremy Wolf introduced VMware’s approach to bringing a cloud experience to private data centers with VCF (VMware Cloud Foundation). He emphasized that applications require more than just runtimes; they need a complete ecosystem. To that end, VCF offers three core runtime services out of the box: VKS (vSphere Kubernetes Service), VM Service (a cloud-like way to consume vSphere VMs), and Container Service (deploying containers without a complete Kubernetes cluster). The goal is to enable consumers, whether platform operators or app developers, to deploy workloads quickly and leverage the surrounding ecosystem, with a focus on extensibility, adaptability, and multi-tenancy.

The presentation elaborated on the architecture, illustrating how VCF provides a declarative API surface, the vSphere supervisor, to consume all resources through Kubernetes APIs. This enables users to leverage familiar tools, such as `kubectl`, and a new VCF CLI. The VCF CLI is designed to interact with resources using plugins, similar to those found in public cloud CLIs. A key benefit is that adding a new service to the ecosystem automatically makes it discoverable through the existing CLI or UI. The resources are like Lego blocks within the same bucket. They can be picked up and used to construct application and workload requirements.

A demo showcased a three-tiered application (MySQL database in a VM, VKS cluster with front-end and back-end apps) deployed entirely through Argo CD and GitOps principles. All the services are used, including the secret service, volume service, network service, and the VM image service. The application is deployed by simply pasting the application YAML, and Argo does its magic. While acknowledging the inherent complexity in deploying diverse application form factors, Wolf clarified that the intent isn’t to mandate such complexity, but rather to provide the flexibility to address specific needs through a unified API service and a namespace construct for isolation. This highlights the benefits of discoverability through the same API.


Deploying Your Apps in vSphere Kubernetes Service is Easy with VMware by Broadcom

Event: Tech Field Day at KubeCon North America 2025

Appearance: VMware by Broadcom Presents at Tech Field Day at KubeCon North America 2025

Company: VMware by Broadcom

Video Links:

Personnel: William Arroyo

This session showcases vSphere Kubernetes Service’s compatibility with existing setups, its Kubernetes-first API, and a rich ecosystem of Out-Of-The-Box (OOTB) services, such as ArgoCD. See a comprehensive, automated workflow in action, from provisioning a new VKS cluster with Terraform to deploying a sample microservices application via ArgoCD, and viewing metrics with integrated Prometheus and Telegraf. We’ll also show how platform engineers utilize Istio to connect, secure, control, and observe services across different infrastructures in VKS deployments.

VMware by Broadcom’s presentation at KubeCon North America 2025, delivered by William Arroyo, highlighted the ease of deploying applications in vSphere Kubernetes Service (VKS). The core of the presentation focused on demonstrating a streamlined workflow. This workflow begins with the platform engineer using Terraform to provision a new VKS cluster and subsequently deploying an example microservices application via ArgoCD. The session incorporated out-of-the-box (OOTB) services, such as ArgoCD, and integrated Prometheus and Telegraf to display metrics. The session also explained how Istio is utilized by platform engineers to connect, secure, control, and observe services across different infrastructures in VKS deployments.

The presentation walked through a complete workflow, combining the roles of an app developer and a platform engineer. The speaker utilized several components, like VKS, cloud services such as cert-manager, and Argo CD. These services were used to bootstrap the cluster with essential elements and applications, including Istio. Crucially, the speaker emphasized the use of APIs accessible via tools like `kubectl`, Terraform, and GitOps, allowing platform engineers flexibility. The demo showcased a Terraform apply command that created a supervisor namespace, an Argo CD instance, and a Kubernetes cluster.

In essence, the presentation focused on how VKS streamlines the entire lifecycle, making it easy to deploy, manage, and observe applications. Through the demonstration of Terraform and Argo CD integration, the speaker underscored the platform’s Kubernetes-first API approach and its compatibility with established tools and workflows. Further, the presentation highlighted the integration of fleet management features and cloud services. The session focused on the core components and the flexibility and power of the VMware by Broadcom offering.


vSphere Kubernetes Service for Platform Engineers – What You Didn’t Know with VMware by Broadcom

Event: Tech Field Day at KubeCon North America 2025

Appearance: VMware by Broadcom Presents at Tech Field Day at KubeCon North America 2025

Company: VMware by Broadcom

Video Links:

Personnel: Timmy Carr

This session will highlight the robust capabilities of VKS for platform engineers, focusing on deep technical insights and real-world value. Discover how VKS offers a fully conformant Kubernetes runtime, simplified lifecycle management with 2-click cluster deployment, and built-in guardrails to ensure stability. We’ll address common pain points and demonstrate how VKS directly solves them, leading to faster development cycles and reduced operational overhead.

The VMware by Broadcom presentation at Tech Field Day at KubeCon North America 2025 focused on vSphere Kubernetes Service (VKS) within the VMware Cloud Foundation (VCF) framework. VKS delivers fully conformant, certified Kubernetes in a user’s data center, ensuring that applications running in other clouds can seamlessly operate on VKS. VCF acts as a cloud within the data center, offering services such as virtual machines, containers, and Kubernetes to end users. The presentation highlighted the role of the cloud administrator in enabling platform engineers to deliver Kubernetes through VKS.

VKS integrates deeply with the VCF infrastructure, managing dependencies for Kubernetes cluster upgrades and offering seamless integration with GPU, storage, and compute resources. VMware emphasizes the rapid delivery of vSphere Kubernetes Releases (VKRs), aiming to release new versions within two months of the upstream Kubernetes community’s releases. The speaker addressed a question about the “service” designation in VKS, clarifying that it refers to a Kubernetes service running on a broader cloud platform with capabilities such as network isolation, object storage, and managed databases. VKS supports multi-cluster management, allowing users to manage and lifecycle their Kubernetes clusters, introspect workloads, manage security policies, and protect data via Valero.

The presentation clearly distinguished VKS and VCF from Tanzu, explaining that Tanzu is a developer-focused platform for delivering code to production, potentially running on top of VCF but not included by default. In a demo, it was highlighted how VKS clusters could be deployed quickly by users via a self-service portal that leveraged upstream Cluster API, and that every aspect of Kubernetes cluster infrastructure could be customized and versioned for declarative management via tools like ArgoCD. The presenter emphasized that VCF delivers these services in a consumable fashion for end users, transforming the traditional virtualization platform into a self-service cloud.


Cisco AI Cluster Networking Operations DLB Demo with Paresh Gupta

Event: Networking Field Day 39

Appearance: Cisco Presents at Networking Field Day 39

Company: Cisco

Video Links:

Personnel: Paresh Gupta

Paresh Gupta concluded the deep dive by focusing on the most complex challenge in AI networking: congestion and load balancing in the backend GPU-to-GPU fabric. He explained that while operational simplicity and cabling are critical, the primary performance bottleneck, even in non-oversubscribed networks, is the failure of traditional ECMP load balancing. Because ECMP hashes flows randomly, it creates severe traffic imbalances, where one link may be congested at 105% capacity while another sits idle at 60%. This non-uniform utilization, not a lack of total capacity, creates congestion, triggers performance-killing pause frames, and can lead to out-of-order packets, which are devastating for tightly coupled collective communication jobs.

To solve this, Cisco has developed advanced load-balancing techniques, moving beyond simple ECMP. Gupta detailed a “flowlet” dynamic load balancing (DLB) approach, where the switch detects inter-packet gaps to identify a flowlet and sends the next flowlet on the current, least-congested link. More importantly, he highlighted a fully validated, joint-reference architecture codesigned with NVIDIA. This solution combines Cisco’s per-packet DLB, running on its switches, with NVIDIA’s adaptive routing and direct data placement capabilities on the SuperNIC. This handshake between the switch and the NIC is auto-negotiated, and Gupta showed video benchmarks of a 64-GPU cluster where this method improved application-level bus bandwidth by 35-40% and virtually eliminated pause frames compared to standard ECMP.

This advanced capability, Gupta explained, is made possible by the P4-programmable architecture of Cisco’s Silicon One ASIC, which allows new features to be delivered without a multi-year hardware respin. He framed this as the foundational work that is now being standardized by the Ultra Ethernet Consortium (UEC), of which Cisco is a steering member. By productizing these next-generation transport features today, Cisco is able to provide a consistent, high-performance, and validated networking experience for any AI environment, offering enterprises a turnkey solution that rivals the performance of complex, custom-built hyperscaler networks.


Cisco AI Networking Cluster Operations Deep Dive

Event: Networking Field Day 39

Appearance: Cisco Presents at Networking Field Day 39

Company: Cisco

Video Links:

Personnel: Paresh Gupta

Paresh Gupta’s deep dive on AI cluster operations focused on the extreme and unique challenges of high-performance backend networks. He explained that these networks, which primarily use RDMA over Converged Ethernet (ROCE), are exceptionally sensitive to both packet loss and network delay. Because ROCE is UDP-based, it lacks TCP’s native congestion control, meaning a single dropped packet can stall an entire collective communication operation, forcing a costly retransmission and wasting expensive GPU cycles. This problem is compounded by AI traffic patterns, such as checkpointing, where all GPUs write to storage simultaneously, creating massive incast congestion. Gupta emphasized that in these environments, where every nanosecond of delay matters, traditional network designs and operational practices are no longer sufficient.

Cisco’s strategy to solve these problems is built on prescriptive, end-to-end validated reference architectures, which are tested with NVIDIA, AMD, Intel Gaudi, and all major storage vendors. Gupta detailed the critical importance of a specific Rail-Optimized Design, a non-blocking topology engineered to ensure single-hop connectivity between all GPUs within a scalable unit. This design minimizes latency by keeping traffic off the spine switches, but its performance is entirely dependent on perfect physical cabling. He explained that these architectures are built on Cisco’s smart switches, which use Silicon One ASICs and are optimized with fine-tuned thresholds for congestion-notification protocols like ECN and PFC.

The most critical innovations, however, are in operational simplicity, delivered via Nexus Dashboard and HyperFabric AI. These platforms automate and hide the underlying network complexity. Gupta highlighted the automated cabling check feature. The system generates a precise cabling plan for the rail-optimized design and provides a task list to on-site technicians; the management UI will only show a port as green when it is cabled to the exact correct port, solving the pervasive and performance-crippling problem of miscabling. This feature, which customers reported reduced deployment time by 90%, is combined with job scheduler integration to detect and flag performance-degrading anomalies, such as a single job being inefficiently spread across multiple scalable units.


Introduction to Cisco AI Cluster Networking Design with Paresh Gupta

Event: Networking Field Day 39

Appearance: Cisco Presents at Networking Field Day 39

Company: Cisco

Video Links:

Personnel: Paresh Gupta

Paresh Gupta, a principal engineer at Cisco focusing on AI infrastructure, began by outlining the diverse landscape of AI adoption, which spans from hyperscalers with hundreds of thousands of GPUs to enterprises just starting with a few hundred. He categorized these environments by scale—scale-up within a server, scale-out across servers, and scale-across between data centers—and by use case, such as foundational model training versus fine-tuning or inferencing. Gupta emphasized that the solutions for these different segments must vary, as the massive R&D budgets and custom software of a hyperscaler are not available to an enterprise, which needs a simpler, more turnkey solution.

Gupta then deconstructed the modern AI cluster, starting with the immense computational power of GPU servers, which can now generate 6.4 terabits of line-rate traffic per server. He detailed the multiple, distinct networks required, highlighting a recent shift in best practices: the front-end network and the storage network are now often converged. This change is driven by cost savings and the realization that front-end traffic is typically low, making it practical to share the high-bandwidth 400-gig fabric. This converged network is distinct from the inter-GPU backend network, which is dedicated solely to GPU-to-GPU communication for distributed jobs, as well as a separate management network and potentially a backend storage network for specific high-performance storage platforms.

Finally, Gupta presented a simplified, end-to-end traffic flow to illustrate the complete operational picture. A user request does not just hit a GPU; it first traverses a standard data center fabric, interacts with applications and centralized services like identity and billing, and only then reaches the AI cluster’s front-end network. From there, the GPU node may access high-performance storage, standard storage for logs, or organization-wide data. If the job is distributed, it ignites the inter-GPU backend network. This complete flow, he explained, is crucial for understanding that solving AI networking challenges requires innovations at every point of entry and exit, not just in the inter-GPU backend.


The Age of Operations: Third Party Views of DCF Innovation

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Scott Robohn

Scott Robohn of the consulting firm Solutional provided a third-party, operational perspective on data center innovation, based on a collaboration with The Futurum Group and Nokia. He introduced his background as a former network engineer and Tech Field Day delegate, now focused on NetOps and AI adoption. Robohn’s central thesis is that with AI driving new infrastructure builds and hardware becoming normalized, the next major performance gains in data centers will come from a relentless focus on improving operations. The joint project between Solutional, Futurum, and Nokia aimed to validate this by looking at AI’s dual role: both as a driver for new network infrastructure and as a set of tools to be used for network operations.

Robohn detailed the market trends shaping this new age of operations, starting with AI as a durable technology driving massive fabric build-outs by hyperscalers and a new class of NeoCloud providers. He defined NeoClouds as specialized providers focused on renting expensive, complex GPU-interconnect infrastructure for workloads like model training and video rendering. He then argued that as data center hardware has normalized around merchant silicon and stable architectures like Spine-Leaf, the hardware itself is no longer the key differentiator. This consistency makes automation more achievable and shifts the entire industry’s focus to operations as the critical area for innovation and reliability.

To validate this operational focus, the project stood on four legs. First, a Futurum reliability survey which found that network reliability is the number one purchasing criterion for data center infrastructure, far outweighing cost, and that human error remains a major issue. Second, a collaboration with Bell Labs on a reliability model, which concluded the most significant reduction in downtime comes from fixing operations-related issues. Third, interviews with Nokia’s own 80,000-employee internal enterprise IT team, who are successfully migrating their complex, high-stakes manufacturing and office networks to Nokia’s SR Linux and EDA platform. Finally, general market engagement confirming a palpable, industry-wide appetite for a new class of automation and AIOps tools.


Root Cause Analysis with Nokia AI Operations Automation

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Clayton Wagar

Clayton Wagar introduced Nokia’s AI-driven approach to root cause analysis, focusing on solving difficult day-two operational challenges. The presentation highlighted the chronic pain of hidden impairments or gray failures, where traditional monitoring systems fail because physical links appear active while protocols or services are down. The goal of Nokia’s deep RCA tool is to move beyond simple port-up/port-down alarming by correlating end-to-end application connectivity (from VM to VM) with all layers of the network, including the underlay, overlay, and control plane, to dramatically compress troubleshooting time.

A live demonstration was shown on the EDA SaaS platform using a real hardware Spine-Leaf network. The team introduced a gray failure by impairing a fiber link in a way that kept the interface status “up” but caused the BFD and BGP protocols to fail. Wagar explained that the AI’s multi-agent workflow correctly diagnosed this. Instead of using one large, monolithic model, a planning model first determines which tool-calling agents to deploy. These agents gather specific, relevant data from logs, topology, and configuration, which is then filtered and passed to a reasoning model. This agentic-based curation of data is Nokia’s key to reducing costs and avoiding the AI hallucinations that would otherwise be a risk in mission-critical networks.

The tool’s capabilities were further demonstrated by successfully identifying a classic, hard-to-find MTU mismatch. Another key feature highlighted was the “Time Machine,” which allows an operator to select a past timeframe, such as thirty minutes prior to an event, and run the same AI-driven root cause analysis on the historical data from that moment. The entire process concludes with the AI generating a comprehensive report that provides a human-readable summary, a confidence score, and the specific evidence gathered by the agents, effectively solving a complex logic puzzle that would have taken an engineer hours or days to manually diagnose.


Delivering Nokia Enhanced AIOps with the Right Foundations

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Bruce Wallis

Bruce Wallis pivoted the discussion to Nokia’s AIOps capabilities, centered on a new natural language interface called Ask EDA. This feature, which resembles a ChatGPT for the network, allows operators to interact with the EDA platform through a simple chat box. The core idea is to abstract the complexity of network operations, enabling users to ask plain-English questions, such as “list my SRL interfaces” or “is BFD enabled?”, and receive back live, structured data, tables, and even on-the-fly visualizations like pie charts and line graphs. This approach removes the need for operators to understand the complex underlying Yang models or schemas for each vendor, as the AI handles the translation from human language to machine query.

The right foundation for this capability, as Wallis explained, is not a single, monolithic, trained model, but a flexible agentic AI framework. In this model, the central LLM acts as a brain that coordinates a set of pluggable agents, or tools, each with a specific function. The most powerful aspect of this design is its real-time extensibility. Wallis demonstrated this by first showing the AI failing to understand a request to “enable the locator LED.” He then installed a new support application from EDA’s App Store; when asked again, the AI agent immediately recognized and used this new tool to successfully execute the command. This app-based approach allows Nokia to add new troubleshooting workflows and capabilities on the fly, without retraining the model or upgrading the core platform.

This agentic framework is applied directly to troubleshooting and operations. Wallis showed how “Ask EDA” can be used to investigate “deviations,” or configuration drift, where the running config no longer matches the intended state. In another example, with a BGP peer alarm active, the AI was asked to investigate. It used its agents to query various resources, analyze the topology, and correctly identified that the BGP manager process had crashed and restarted, providing a direct link to the deviation. Wallis emphasized that this method of using the LLM to query factual data from live telemetry and tools is how Nokia is addressing the problem of hallucinations, ensuring the AI’s answers are grounded in reality.


Nokia Event-Driven Automation (EDA) Multi Vendor – Deliver on the Promise

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Bruce Wallis

Bruce Wallis, Product Manager for Nokia’s EDA (Event-Driven Automation) platform, began by revisiting the core assertions made when the product was unveiled a year prior. He reiterated that EDA was built on the same successful principles as Kubernetes: abstraction and a declarative model. The goal was to apply this logic to networking, creating a ubiquitous platform that could define a unit of work for the network, just as Kubernetes did for compute workloads. This approach aims to normalize networking primitives like interfaces and BGP peers, allowing an operator to declare the desired end state without scripting the specific sequential steps, letting the platform handle the how.

The presentation’s main focus was delivering on the multi-vendor promise made at the previous event. Wallis conducted a live demo, bootstrapping an eight-node, dual-stack fabric underlay using a single 58-line YAML file. This high-level abstract definition was automatically reconciled by EDA, which then generated and pushed the correct, vendor-specific configurations to four different operating systems running on the leaf switches: Nokia SR Linux, Nokia SROS, Cisco Nexus, and Arista EOS. This demonstrated the platform’s ability to manage a heterogeneous network through one common model.

Finally, Wallis addressed other key platform features, including the ability to bubble operational state up into the abstract model, allowing operators to view the “health” of the entire fabric rather than just individual components. He also clarified for delegates that while he used YAML for demo speed, the platform is fully operable via a form-based UI for users unfamiliar with programmatic inputs. He concluded the demo by successfully deploying a complex EVPN overlay network across the newly built, multi-vendor underlay, again using a single, simple declarative input.


Exploring the Power and Potential of Enhanced AIOps with Nokia

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Clayton Wagar

Clayton Wagar, leading the AI practice for Nokia’s IP division, framed his presentation as creating a “through line” connecting the history of network operations to its AI-driven future. Recalling his own start managing a 911 data center with manual, CLI-driven processes, he traced the evolution of the industry as operators moved from mastering workflows in their heads to writing them down and eventually using automation. Wagar emphasized that as AI is introduced, it’s crucial to be prescriptive about its use and to understand its two main facets: the “plumbing” aspect of building massive networks to support AI, and the application of AI tools to network operations itself.

To illustrate the challenge of applying AI to mission-critical systems, Wagar told a story about a 1943 discussion at Bell Labs between Claude Shannon and Alan Turing. They faced a choice: build a computer like a human brain (a neural network) or like an adding machine (a deterministic system). They chose the adding machine, not only because the technology for neural networks didn’t exist, but critically because telcos and governments required predictable, deterministic outputs, not a system that might hallucinate. This historical context highlights the primary challenge Nokia addresses today: reducing AI hallucinations to make the technology safe for essential, real-world networks where reliability is paramount.

Wagar then connected this to the modern concept of autonomous networks, such as the levels defined by TM Forum. He proposed that these frameworks were largely developed before modern AI and assumed a deterministic path, whereas AI introduces a new, separate plane of capability. He pointed to Google’s public journey toward autonomous networking, which leverages custom-built AI agents to move beyond simple event-driven workflows to a truly autonomous SRE model. Wagar concluded by positioning Nokia’s strategy as learning from these leaders, blending AI and traditional automation to inform their product development and establish new best practices for the industry.


Introduction to Nokia with Andy Lapteff

Event: Networking Field Day 39

Appearance: Nokia Presents at Networking Field Day 39

Company: Nokia

Video Links:

Personnel: Andy Lapteff

Andy Lapteff, a network engineer who became a product marketing manager for Nokia Data Center, introduced his presentation by sharing his personal journey. He admitted that, like many, he previously only knew Nokia for its indestructible cell phones and was unaware of its significant presence in the data center market. It was at a previous Tech Field Day event, attending as a delegate, that he learned Nokia had been building mission-critical networking infrastructure for decades for entities like air traffic control and power grids, and was now applying those same principles of ultra-reliability to the data center. This revelation was so compelling that it motivated him to join the company.

Lapteff contrasted Nokia’s approach with his own past experiences working midnights in a network operations center (NOC), where networks were complex, fragile, and “on fire all the time,” and a “no change Fridays” mentality was common. His motivation at Nokia is driven by the memory of those challenges, aiming to build more robust and reliable networks for data center operators. He expressed genuine excitement for the value Nokia provides, which he feels is real and tangible, unlike other vendor jobs he has held.

The presentation set the stage for a deeper dive into the future of networking, particularly focusing on the often-confusing landscape of AI, AIOps, and the different levels of network autonomy. Lapteff noted that while the industry is largely at a manual or low-automation level, Nokia is pushing toward full autonomous operations. He concluded by previewing the day’s key topics, promising impressive demonstrations on “Agentic AI ops in action,” AI-driven root cause analysis, the automation of AI back-end fabrics, and a much-anticipated update on Nokia’s potential support for non-Nokia hardware.


The Resource Costs of AI

Event: Networking Field Day 39

Appearance: Networking Field Day 39 Delegate Roundtable Discussion

Company: Tech Field Day

Video Links:

Personnel: Tom Hollingsworth

The Networking Field Day 39 roundtable, led by Tom Hollingsworth, dove straight into the massive resource drain caused by the current AI boom. The delegates discussed how the industry is already behind the eight ball on power, with AI’s exponential demand making the existing, outdated power grid’s problems significantly worse. This isn’t just a data center issue. It’s causing widespread component shortages for essentials like RAM and GPUs, affecting everyone from enterprise users to home gamers. The conversation highlighted that consumers are ultimately paying for this AI race, not just through new tech, but through soaring power and water bills as AI companies with deep pockets outbid ordinary consumers for finite resources.

In response to this power crisis, the discussion shifted to solutions. The delegates noted a serious new investigation into small modular nuclear reactors and even restarting old plants like Three Mile Island, things unthinkable just a few years ago, alongside more hopeful developments in solar. On the networking side, this resource demand is forcing the creation of entirely new, expensive technologies like Ultra Ethernet and massive 800-gig switches just to keep these AI data centers fed. These come with huge R&D costs, which will inevitably be passed down, raising the question of whether networking costs are about to go through the roof for reasons outside the network engineer’s control.

Finally, the panel debated the future of AI itself, noting the buzzword is becoming meaningless as the industry pivots from massive LLMs to more efficient, domain-specific smaller models (SLMs) and agentic AI. The group observed that AI might move from giant cloud data centers to running locally on devices with new NPUs, or even on standard laptops. Ultimately, the delegates concluded that the future of AI won’t just be decided by the hyperscalers. It will be shaped by consumers and engineers through the products they choose to use and the companies they criticize for wasting resources.