|
![]() Kedar Dhuru, Praful Lalchandani, Jeremy Wallace, Kyle Baxter, and Vikram Singh presented for Juniper Networks at AI Infrastructure Field Day 2 |
This Presentation date is April 23, 2025 at 08:00 - 11:30.
Presenters: Jeremy Wallace, Kedar Dhuru, Kyle Baxter, Praful Lalchandani, Vikram Singh
AI Unbound, Your Data Center Your Way with Juniper Networks
Watch on YouTube
Watch on Vimeo
Praful Lalchandani, VP of Product, Data Center Platforms and AI Solutions at Juniper Networks, opened the presentation by highlighting the rapid growth of the AI data center space and its unique challenges. He noted that Juniper Networks, with its 25 years of experience in networking and security, is uniquely positioned to address these challenges and help customers meet the demands of AI. Juniper is experiencing maximum momentum in the data center space, with revenues exceeding $1 billion in networking alone in 2024. The presentation then dove into the increasing distribution of AI workloads across hyperscalers, Neo Cloud providers, private clouds, and the edge, emphasizing Juniper’s comprehensive portfolio of solutions spanning data center fabrics, interconnectivity, and security.
Lalchandani focused on the critical role of networking in the AI lifecycle, particularly for training and inference. High bandwidth, low latency, and congestion-free networking are essential to optimizing job completion time for training and throughput and minimizing latency for inferencing. The discussion highlighted Juniper’s innovations in this space, including developing AI load balancing capabilities such as Dynamic Load Balancing, Global Load Balancing, and RDMA-aware load balancing. Juniper was the first vendor in the industry to ship a 64-port 800-gig switch, showcasing Juniper’s commitment to providing the bandwidth needed for AI workloads and achieving a leading 800-gig market share.
The presentation also emphasizes AI clusters’ operational complexity and security challenges. Juniper’s Apstra solution offers lifecycle management, from design to deployment to assurance, providing end-to-end congestion visibility and automated remediation recommendations. Security is paramount, and Juniper advocates a defense-in-depth approach with its SRX portfolio, protecting the edge, east-west security within the fabric, encrypted data center interconnects, and security for public cloud applications. The presentation concluded by addressing the dilemma customers face between open, best-of-breed technologies and proprietary, tightly coupled ecosystems, and that Juniper offers validated designs with AI labs to show customers they don’t have to make a trade-off and can get the best of both worlds.
Personnel: Praful Lalchandani
Maximize AI Cluster Performance using Juniper Self-Optimizing Ethernet with Juniper Networks
Watch on YouTube
Watch on Vimeo
Vikram Singh, Sr. Product Manager, AI Data Center Solutions at Juniper Networks, discussed maximizing AI cluster performance using Juniper’s self-optimizing Ethernet fabric. As AI workloads scale, high GPU utilization and minimized congestion are critical to maximizing performance and ROI. Juniper’s advanced load balancing innovations deliver a self-optimizing Ethernet fabric that dynamically adapts to congestion and keeps AI clusters running at peak efficiency.
The presentation addressed the unique challenges posed by AI/ML traffic, which is primarily UDP-based with low entropy, bursty flows, and the synchronous compute nature of data parallelism, where GPUs must synchronize gradients after each iteration. This synchronization makes job completion time a key metric, as delays in a single flow can idle many GPUs. Traditional Ethernet, designed for TCP in-order delivery requirements, doesn’t efficiently handle this type of traffic, leading to congestion and performance degradation. Solutions like packet spraying using specialized NICs or distributed scheduled fabrics are expensive and proprietary.
Juniper offers an open, standards-based approach using Ethernet, called AI load balancing, which includes dynamic load balancing (DLB) that enhances static ECMP by tracking link utilization and buffer pressure at microsecond granularity to make informed forwarding decisions. DLB operates in flowlet mode (breaking flows into subflows based on configurable pauses) or packet mode (packet spraying). Global Load Balancing (GLB) enhances DLB by exchanging link quality data between leaves and spines, enabling leaves to make more informed decisions and avoid congested paths. Juniper’s RDMA-aware load balancing (RLB) uses deterministic routing by assigning IP addresses to subflows, eliminating randomness and ensuring consistent high performance, in-order delivery, and non-rail performance without expensive hardware.
Personnel: Vikram Singh
Securing AI Clusters, Juniper’s Approach to Threat Protection with Juniper Networks
Watch on YouTube
Watch on Vimeo
AI clusters are high-value targets for cyber threats, requiring a defense-in-depth strategy to safeguard data, workloads, and infrastructure. Kedar Dhuru highlighted how Juniper’s security portfolio provides end-to-end protection for AI clusters, including secure multitenant environments, without compromising performance. The presentation addressed the challenges of securing AI data centers, focusing on securing WAN and data center interconnect links and preventing data loss, which are amplified by the increased scale, performance, and multi-tenancy requirements of these environments.
Juniper’s approach to securing AI data centers involves several key use cases. These include protecting traffic between data centers, preventing DDoS attacks to maintain uptime, securing north-south traffic at the data center edge with application and network-level threat protection, and implementing segmentation within the data center to prevent lateral movement of threats. These security measures can be applied to traditional data centers and public cloud environments with the same functionalities and adapted for cloud-specific architectures. Juniper focuses on high-speed IPsec connections to ensure data encryption without creating performance bottlenecks.
Juniper uses threat detection capabilities to identify indicators of compromise, including inspecting downloaded software and models for tampering and detecting malicious communication from compromised models. Their solution employs multiple firewalls, machine learning algorithms, threat feeds to detect and block malicious activity, and a domain generation algorithm (DGA) and DNS security to protect against threats. The presentation also highlighted Juniper’s new one-rack-unit firewall with high network throughput and MACSec capabilities, along with multi-tenancy and scale-out firewall architecture features.
Personnel: Kedar Dhuru
GPYOU: Building and Operating your AI Infrastructure with Juniper Networks
Watch on YouTube
Watch on Vimeo
AI infrastructure is a critical but complex domain, and IT organizations face the pressure to deliver results quickly. Juniper Networks shows Juniper Apstra as a solution to streamline the management of AI data centers, providing proven designs. Kyle Baxter emphasizes the necessity of a robust network foundation for AI and ML workloads and highlights the challenges of traditional network management tools, which often overwhelm users with data, making it challenging to pinpoint root causes and resolve issues efficiently.
Juniper addresses these challenges by offering a comprehensive solution built on the Apstra platform. This platform features a contextual graph database, intent-based networking, and a vendor-agnostic design approach. Combined with Mist AI and the Marvis Virtual Network Assistant, Juniper aims to provide a holistic view of the data center, moving away from managing individual switches to focusing on delivering desired outcomes. This approach simplifies the complex network, allowing for precise identification of root causes, related symptoms, and impacted applications or training jobs.
The presentation focuses on managing the training side of AI and ML clusters. It highlights Apstra’s global capabilities to manage various data center networks, including back-end, storage, and inference networks, for large enterprises. Juniper offers designs and flexibility to manage any network design using a single tool. The key takeaways are the ability to design, deploy, and assure network operations, utilizing Juniper’s leading switching portfolio and security solutions. This aims to provide a streamlined, efficient, and reliable AI infrastructure management solution.
Personnel: Kyle Baxter
Day 0: Designing your AI data center with Juniper Networks
Watch on YouTube
Watch on Vimeo
Juniper Networks’ presentation at AI Infrastructure Field Day focuses on designing AI data centers using Apstra, specifically emphasizing rail-optimized designs and highlighting Apstra’s ability to create a fully functional network architecture in just minutes, incorporating native modeling for these specialized designs. Kyle Baxter, Head of Apstra Product Management at Juniper Networks, demonstrates how Apstra simplifies the complex process of deploying AI/ML fabrics by providing expert assistance.
The presentation emphasizes the speed and efficiency that Apstra brings to AI data center design. Baxter showcases how users can input basic architectural requirements, such as the number of GPUs and servers. Apstra will then generate design options, including cabling maps, in a short amount of time. This capability allows users to visualize and validate their designs before ordering hardware, reducing potential errors and streamlining the deployment process. The system enables users to build templates for different design types, such as collapsed rail or L3 Clos, making it easy to reuse designs as needed.
Furthermore, the presentation highlights Apstra’s flexibility and comprehensive approach to network management. Beyond designing the core AI/ML fabric, the platform can manage various network types within a single instance, including storage and front-end networks. Apstra facilitates the generation of configuration data for multiple vendors, ensuring that configurations are correct, and offering validation and continuous monitoring to maintain the integrity of the network. With features like REST-based APIs, Python SDKs, Terraform providers, and Ansible support, users can customize and integrate Apstra with existing workflows, making it a powerful tool for designing and managing AI infrastructure.
Personnel: Kyle Baxter
Day 1: Managing your AI data center at scale with Juniper Networks
Watch on YouTube
Watch on Vimeo
This presentation by Kyle Baxter focuses on how Juniper Networks’ Apstra solution can manage AI data centers at scale. Apstra simplifies network configuration for AI/ML workloads by providing tools to assign virtual networks across numerous ports, an essential capability in environments with potentially millions of ports. The core of the presentation highlights the ability to provision virtual networks and configure load balancing with ease, using an intent-based approach that simplifies complex network tasks. This reduces the burden of manual configuration and allows users to quickly deploy and manage their AI data centers, regardless of the number of GPUs.
Through its continuous validation features, Baxter demonstrates how Apstra allows users to pre-emptively catch configuration issues, such as missing VLAN assignments. This prevents errors before they impact operations. Furthermore, the system enables bulk operations to streamline the assignment of virtual networks and subnets across the entire infrastructure. The solution includes options for selecting load balancing policies, with clear explanations through built-in help text. Juniper focuses on simplifying tasks through an intuitive interface to minimize the need for command-line configuration and facilitate a faster and more efficient deployment process for AI data centers.
Finally, Baxter mentions the delivery of Apstra as a virtual machine, with the solution available for download on the Juniper website. It can be deployed on-premise to manage a network and integrated with MIST AI for AI-driven network operations. The upcoming release of version 6.0 and support for advanced features like RDMA load balancing were also discussed. The key message is that Juniper’s Apstra enables efficient deployment and management of AI infrastructure at scale, regardless of the deployment size, simplifying complex tasks with its user-friendly interface and automated features.
Personnel: Kyle Baxter
Day 2: Operating your AI data center with Juniper Networks
Watch on YouTube
Watch on Vimeo
Juniper Networks presented its latest Apstra functionality for AI data center network operations at AI Infrastructure Field Day. It focused on providing operators with the context and tools to manage complex AI networks efficiently. Jeremy Wallace, a Data Center/IP Fabric Architect, emphasized the importance of context in understanding the network’s expected behavior to identify and resolve issues quickly. Juniper is leveraging existing Apstra capabilities, augmented with new features such as compute agents deployable on NVIDIA servers, and enhanced probes and dashboards, to monitor AI networks. This presentation aims to equip operators to maintain optimal performance and minimize downtime in critical infrastructure environments.
The presentation highlighted the evolution of network management for AI data centers, transitioning from traditional methods to a more proactive and data-driven approach. The core of Juniper’s solution involves leveraging telemetry, including data collected from GPU NICs and switches, to provide real-time insights into network performance. This enables operators to monitor key metrics, such as GPU network utilization and traffic patterns, and respond to potential issues swiftly. The Honeycomb view, traffic dashboards, and integration with congestion control mechanisms (ECN and PFC) demonstrate how to provide visibility into the network’s behavior. The goal is to provide context and the tools to diagnose and resolve problems faster.
Finally, Wallace demonstrated a live demo of the platform, showcasing features like real-time traffic analysis, heatmaps of GPU utilization, and auto-tuning load balancing. The auto-tuning functionality dynamically adjusts parameters like inactivity intervals to optimize performance and eliminate out-of-sequence packets, increasing the likelihood of successful job completion. These power packs are essentially Python scripts and are evolving, with Juniper actively working on creating more of these power packs. Juniper is also working on deeper integration with other vendors for their customers’ environments and solutions.
Personnel: Jeremy Wallace