Follow on Twitter using the following hashtags or usernames: #NFD40
Watch on YouTube
Watch on Vimeo
In this presentation, Andy Lapteff, a network engineer turned product marketing manager at Nokia, introduces the company’s shift from its iconic mobile phone legacy to its current role as a leader in mission-critical network infrastructure. He highlights Nokia’s historical reputation for reliability, exemplified by the legendary indestructible 3310 phone, and explains how that same commitment to durability now applies to complex systems like train signaling, power grids, and air traffic control. Lapteff acknowledges the industry-wide fatigue regarding AI hype but emphasizes that Nokia’s objective at Networking Field Day 40 is to move past the noise and provide clear signal on how networking is fundamentally changing to accommodate the unique demands of the AI era.
Lapteff shares his personal evolution from a skeptic of AI networking to a believer, citing the unprecedented rate of adoption and the staggering capital investment in the sector. He notes that while the Apollo space program cost the equivalent of $65 billion in today’s dollars, current global investment in AI infrastructure is reaching approximately $690 billion this year alone. To illustrate the tangible impact of these technologies, he describes how generative AI has transformed him into a software developer capable of automating complex, multi-step workflows that previously took hours. These tools are no longer just fancy Google searches but are functional drivers of productivity that require a completely reimagined underlying network.
The summary concludes by addressing the technical shift required to support these modern workloads, asserting that traditional protocols like TCP are no longer sufficient because the data center has essentially become one massive, interconnected computer. Lapteff argues that if the network is not as reliable as the indestructible hardware Nokia was once famous for, the entire expensive AI system will fail. The presentation sets the stage for a series of technical deep dives covering new networking models, optimized designs, the evolution of Ethernet and transport protocols, and the operational platforms necessary to keep these high-stakes environments running efficiently.
Personnel: Andy Lapteff
Watch on YouTube
Watch on Vimeo
The rise of AI has driven the emergence of multiple new network domains, each with distinct roles, architectures, and performance requirements. This presentation explores these new networks and their roles. Patrick McCabe, representing Nokia, builds on the premise that AI is a permanent fixture in the technological landscape, requiring a non-linear evolution of network architecture. He identifies two primary functions, training and inferencing, as the drivers of this change. Training involves massive GPU clusters that scale geometrically and are highly sensitive to packet loss, while inferencing, often pushed to the network edge, prioritizes low latency to serve end users effectively. Together, these functions demand a move away from traditional statistical averages toward a more deterministic approach to network performance.
Architecturally, the shift from north-south to massive east-west traffic patterns within GPU clusters has rendered traditional leaf-spine designs inadequate for AI data movement. McCabe details the emergence of specialized backend networks categorized as scale-up, scale-out, and scale-across. Scale-up handles communication within a single system or server, while scale-out facilitates high-speed interaction between different systems within a data center, a primary focus for the Ultra Ethernet Consortium (UEC). Scale-across is a particularly challenging new frontier, necessitated by the fragmentation of AI clusters across different physical locations, often due to power constraints, requiring advanced routing and data center interconnects to maintain the illusion of a single compute entity over distances of 10 kilometers or more.
The presentation emphasizes that the center of this new universe is the GPU, supported by essential storage networks that feed vast amounts of data to processing units. While the back end deals with the rigors of scale and reliability, the front end remains more traditional, connecting these specialized environments to the outside world and end users. McCabe concludes with an analogy comparing AI to the printing press, suggesting that while AI lowers the cost and scarcity of production it does not replace the human creator. Instead, it shifts the premium value toward innovation, ideas, and judgment, allowing for a radical expansion of who can create within this high-performance infrastructure.
Personnel: Patrick McCabe
Watch on YouTube
Watch on Vimeo
Get an inside look at how Nokia Validated Designs (NVDs) streamline AI-ready data center and networking deployments through proven architectures, rigorous validation, and real-world performance insights. We’ll highlight several of our latest AI-focused NVDs, show how partners are extending them, and preview what’s coming next as we evolve the portfolio to meet the demands of modern, high-performance networks. Vivek Venugopal explains that Nokia treats network construction with the same intolerance for failure as aeronautical engineering, ensuring that every NVD is pre-tested on physical hardware to guarantee reliability. These designs are developed through an iterative workflow that begins with industry ideation and digital twin modeling in container labs, followed by extensive hardware validation of optics, cables, and protocols. Unlike rigid templates, NVDs serve as documented tech stacks that customers can customize, backed by a four-year support lifecycle that treats the design itself as a managed product.
The presentation highlights several AI-specific architectures, including a rail-only design developed with Lenovo and AMD for small-to-medium clusters and a more complex two-stripe pod design for larger environments using NVIDIA H200 or AMD GPUs. A key innovation discussed is the use of VRFs to emulate multiple leaf switches, allowing customers to scale their GPU clusters accurately without over-provisioning hardware. To ensure these networks are truly lossless, Nokia rigorously validates the interaction between Explicit Congestion Notification (ECN) and Priority Flow Control (PFC). The goal is to ensure ECN triggers first to slow down traffic before PFC pauses frames, preventing the catastrophic tail drops that would force an AI training model to restart from a previous checkpoint.
To prove the real-world efficacy of these designs, Nokia goes beyond simple network specifications to perform application-level benchmarking using open-source tools like Llama 2 and BERT. By measuring job completion times and tokens per second against MLCommons standards, they provide a full-stack validation that includes the GPU servers, storage fabrics (using partners like VAST Data or DDN), and the backend network fabric. The NVD roadmap continues to expand with upcoming designs for scale-across architectures, multi-plane fabrics, and storage-focused deployments. All automation playbooks, telemetry stacks, and digital twin models are made available on GitHub, allowing engineers to try before they buy and ensuring the designs remain accessible and open for integration with common frameworks like Ansible and Netbox.
Personnel: Vivek Venugopal
Watch on YouTube
Watch on Vimeo
Ethernet continues to evolve to meet the performance and scaling demands of modern AI networking architectures, progressing from RoCEv2 toward innovations driven by the Ultra Ethernet Consortium (UEC). This presentation discusses these requirements and introduces UEC Specification 1.0, with a focus on scale-out AI designs and the core philosophies shaping its development. Key Ethernet capabilities defined in UEC 1.0, both already implemented and forthcoming, are highlighted to show how Ethernet is being optimized for large-scale AI workloads. Alfred Nothaft explains that the primary challenge in AI fabrics is congestion management, particularly during the synchronization phases of training where thousands of GPUs simultaneously attempt to share massive amounts of gradient data. While legacy tools like ECN and PFC provide basic notification and pause mechanisms, they are often insufficient for the high-velocity requirements of current AI clusters.
The move toward UEC 1.0 represents a fundamental shift from network-centric congestion control to an end-node-centric philosophy. Under the RoCEv2 model, the network infrastructure is largely responsible for managing traffic flows and reacting to congestion. In contrast, UEC shifts the intelligence to the Network Interface Card (NIC) at the GPU endpoint. This allows for more granular, per-packet load balancing rather than traditional flow-based hashing, enabling the NIC to “spray” traffic across multiple paths and dynamically adjust based on real-time telemetry. Furthermore, the UEC transport (UET) is designed to be connectionless and includes native, hardware-level security and encryption from the outset, addressing data sovereignty and privacy concerns that were previously overlooked in backend fabrics.
UEC 1.0 introduces several sophisticated mechanisms to ensure job completion times are minimized. These include packet trimming, which reduces a packet to its header during congestion to signal the source without losing the stream’s context, and advanced in-band telemetry for precise congestion signaling. The specification also features link-layer retransmission to quickly recover from localized bit errors and credit-based flow control to meter traffic before it ever saturates the fabric. By leveraging Ethernet’s vast ecosystem and rapid bandwidth scaling, doubling speeds every two years toward 1.6 terabits, Nokia and the UEC aim to provide a highly flexible, vendor-neutral alternative to proprietary interconnects, supporting everything from local scale-out clusters to geodistributed scale-across environments.
Personnel: Alfred Nothaft
Watch on YouTube
Watch on Vimeo
Explore the essential management considerations for building and operating multi-tenant AI data center networks. Attendees will learn why abstraction is critical to achieving the scale, speed, and consistency required for AI infrastructure. The presentation will demonstrate how event-driven automation (EDA) simplifies the design, deployment, and operation of backend AI networks, enabling secure and efficient multi-tenancy at scale. Zeno Dhaene, Product Manager for Nokia’s Event-Driven Automation (EDA), emphasizes that while AI data centers may appear uniform, they are uniquely defined by specific physical locations and business needs, such as GPU-as-a-service or shared internal infrastructure. To manage this complexity, EDA utilizes declarative intent and multiple layers of abstraction, allowing operators to treat an entire data center, comprising hundreds of thousands of configuration lines, as a single, manageable resource.
During the demonstration, Dhaene builds a functional AI backend featuring two stripes and a spine connector, capable of hosting approximately 2,000 GPUs, using only high-level labels rather than manual interface assignments. By tagging nodes and interfaces with metadata like “role,” “tenant,” and “data center,” EDA automatically orchestrates the underlying technical requirements, including BGP peering, IP address pooling, and the creation of isolated virtual routers for different tenants. This process is validated through a dry run feature that checks generated configurations against switch YANG models, ensuring accuracy before deployment. The platform’s ability to emit over 2,000 output resources from a single input resource illustrates the efficiency of moving away from traditional, manual configuration methods toward highly automated, intent-based systems.
The presentation also highlights EDA’s flexibility and integration capabilities, noting that the platform can serve as a standalone orchestrator or a gateway that pulls data from external sources of truth like Netbox or Nautobot. This allows for zero-touch automation where internal business tools can trigger network changes, such as provisioning GPUs for a customer, without direct operator intervention. Dhaene concludes by showcasing EDA’s real-time telemetry streaming and its digital twin capabilities, which allow engineers to simulate and test entire data center fabrics on a laptop. By providing a scalable framework for both cookie-cutter and highly customized environments, Nokia’s EDA addresses the critical need for speed and reliability in the rapidly evolving AI infrastructure market.
Personnel: Zeno Dhaene
Watch on YouTube
Watch on Vimeo
AI networks require purpose-built hardware platforms designed for different roles across the infrastructure. This presentation outlines the hardware platforms positioned for these roles highlighting how each supports performance, bandwidth, and operational needs. This preso will have a focus on with a focus on the scale-out part of the network. It also looks ahead to emerging platforms designed for scale-across architectures, enabling the next phase of large-scale, interconnected AI systems. Igor Giangrossi, lead of hardware product management at Nokia, details the specialized data center portfolio that moves beyond the traditional 7750 SR into platforms specifically optimized for the high-throughput, low-latency demands of AI training and inference.
The presentation focuses heavily on the 7220 IXR series, which utilizes Broadcom Tomahawk chipsets to drive the scale-out portion of the network. Giangrossi introduces the Tomahawk 5 (H5) generation, offering up to 51.2 Tbps capacity with 800G ports, and the newer Tomahawk 6 (H6) generation, which doubles density to 128 ports of 800G or provides 1.6T Ethernet capabilities. A notable advancement in the H6 family is the introduction of liquid-cooled models designed for 21-inch OCP ORV3 racks, addressing the extreme power densities required as AI clusters scale. These platforms integrate advanced features like packet trimming and credit-based flow control into the packet pipeline to manage congestion and improve job completion times.
For scale-across and deep-buffered routing roles, Nokia utilizes the Broadcom Jericho family, including the 7250 IXR-X4 pizza box and the massive IXR-e chassis series. These platforms provide the necessary buffering for geodistributed clusters and long-reach interconnects while maintaining high port density, such as 576 ports of 800G in a single 18E chassis. The hardware design prioritizes operational efficiency and reliability through a mid-plane-less orthogonal architecture, honeycomb meshes for improved airflow, and the deliberate avoidance of retimers to reduce power consumption by up to 30%. This tiered approach ensures that the most appropriate silicon, whether Tomahawk, Jericho, or Nokia’s proprietary FP NPU, is deployed for each specific role in the AI infrastructure.
Personnel: Igor Giangrossi
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!