|
Jay Wilson, Mansour Karam, Nick Davey, Praful Lalchandani, and Rajagopalan Subrahmanian presented for Juniper Networks at Cloud Field Day 20 |
This Presentation date is June 12, 2024 at 8:00-11:30.
Presenters: Jay Wilson, Mansour Karam, Nick Davey, Praful Lalchandani, Rajagopalan Subrahmanian, Vikram Singh
Seize the AI Moment with Juniper Networks
Watch on YouTube
Watch on Vimeo
AI transformation is here. Is your data center up to the task? Juniper’s AI-Native Networking Platform is optimized for the connectivity, data, volume, and speed requirements of mission-critical AI workloads. Our AI data center solution offers the fastest and most flexible way to deploy reliable high-performing AI training, inference, and storage clusters, and the simplest to operate with limited IT resources.
Mansour Karam, Global VP for Data Center at Juniper Networks, highlighted the rapid advancements and transformative potential of AI in his presentation at Cloud Field Day 20. He likened the current excitement surrounding AI to the early days of significant technological breakthroughs, such as the advent of x86 processors and the introduction of the iPhone. Karam emphasized that Juniper’s AI-Native Networking Platform is designed to meet the demanding connectivity, data volume, and speed requirements of AI workloads. The platform aims to simplify the deployment and operation of AI training, inference, and storage clusters, even with limited IT resources, ensuring high performance and reliability.
Karam detailed the evolution of networking technologies and their pivotal role in supporting various technological revolutions, from the internet and cloud to mobile and digital transformations. He underscored the importance of Ethernet in AI clusters, noting its widespread deployment, extensive ecosystem, and cost-effectiveness compared to specialized networks like InfiniBand. Juniper has been at the forefront of these advancements, offering both Broadcom-based and custom solutions to deliver high-performance Ethernet capabilities. The company has also focused on AI for networking, leveraging AI to transform network operations and enhance reliability, speed, and efficiency through automation and intent-based networking.
Furthermore, Karam introduced Apstra, a tool acquired by Juniper, which automates the entire lifecycle of data center networks using intent-based networking. Apstra ensures reliability and speed in network operations by pre-validating configurations and maintaining a closed-loop system for continuous telemetry collection and real-time interaction with the network. Juniper’s AI-driven approach also includes probabilistic insights and application awareness, providing comprehensive visibility and control over network performance. By combining these elements, Juniper aims to offer a robust, flexible, and vendor-agnostic solution for managing AI workloads and optimizing network performance, thus positioning itself as a leader in the AI networking wave.
Personnel: Mansour Karam
Networks Myths to Solutions – Juniper’s Approach to AI Data Centers
Watch on YouTube
Watch on Vimeo
In the rapidly evolving landscape of AI-driven data centers, misconceptions abound, often clouding the decision-making process for IT professionals. In this session, we embark on a myth-busting journey to unveil the realities of AI data centers. This session will debunk common misconceptions and showcase how Juniper Networks’ cutting-edge networking solutions can transform and optimize your data center infrastructure.
Juniper Networks’ presentation at Cloud Field Day 20, led by Praful Lalchandani, focused on debunking myths surrounding AI-driven data centers and demonstrating how Juniper’s networking solutions can optimize these environments. Lalchandani emphasized the significant role of networking in maximizing the return on investment (ROI) for expensive GPU assets, which are central to AI projects. He outlined Juniper’s mission to deliver high network performance (400 gig and 800 gig), ease of operations through Appstra, and a lower total cost of ownership compared to proprietary technologies. The presentation included a detailed explanation of the AI application lifecycle, from data gathering and preprocessing to training and inference, highlighting Juniper’s solutions for both training and inference clusters.
The discussion delved into the specific network requirements for AI training and inference clusters. Lalchandani explained the critical role of the network in job completion time for training models, which involves complex traffic patterns due to data parallelism and model parallelism. He described the different types of networks involved: the backend GPU training network, the backend dedicated storage network, and the frontend network for external connectivity and orchestration. Lalchandani also addressed the importance of eliminating network bottlenecks to prevent GPUs from idling, which would otherwise lead to inefficiencies and increased costs. The presentation highlighted Juniper’s approach to achieving these goals through advanced load balancing techniques and congestion control mechanisms.
In addition to performance optimization, Lalchandani discussed the economic advantages of using Ethernet over InfiniBand for AI data centers. He presented evidence from Juniper’s AI innovation lab and customer use cases, showing that Ethernet can match InfiniBand’s performance while being more cost-effective and operationally simpler. The presentation also tackled the myth that packet spraying is necessary for maximizing performance, demonstrating that Juniper’s dynamic load balancing and Flowlet techniques can achieve near-optimal results without requiring expensive SmartNICs. Finally, Lalchandani touched on the importance of lossless networking, showing that while 100% lossless networking is not always required, Juniper’s solutions can adapt to different model sensitivities, providing flexibility and efficiency in AI data center operations.
Personnel: Praful Lalchandani
Your Private AI Data Center, as Easy as Cloud with Juniper Networks
Watch on YouTube
Watch on Vimeo
Achieve public cloud-like service consumption in your on-prem AI data center with Apstra and Terraform. Apstra and the Terraform Provider for Apstra fit traditionally complex network services, like EVPN, neatly into a predefined application automation. This session shows how network teams can self-serve network services in a familiar way, providing seamless deployments across any infrastructure for new AI/ML workloads.
Nick Davey from Juniper Networks discussed how the increasing complexity and scale of modern data centers, particularly for AI and ML workloads, necessitate advanced automation solutions. He introduced Juniper Validated Designs (JVDs) and highlighted the AI JVD, which includes not just architectural diagrams and configurations but also the required automation to bring these designs to life using Apstra. The AI JVD allows for flexible, scalable, and automated deployment of AI data centers, making it possible to adapt to specific network requirements and efficiently manage the intricate configurations needed for AI workloads.
Central to this automation is Apstra, which provides a cloud-like API to manage physical data center resources. Apstra’s design cycle allows network engineers to move from traditional whiteboard planning to orchestrated, automated design and deployment. This process involves designing the network in Apstra, assigning physical resources, and deploying the network using Zero Touch Provisioning (ZTP). Apstra also supports continuous monitoring and optimization, ensuring the network remains in the desired state, which is crucial for handling the demanding and complex nature of AI workloads.
The integration with Terraform further enhances automation capabilities, allowing for bulk operations and dynamic infrastructure management. Terraform’s declarative approach complements Apstra’s intent-based networking, enabling users to manage their data centers as code. This integration facilitates seamless deployment and management of AI data centers, ensuring that network changes and configurations can be version-controlled, tested, and automated. Additionally, the use of ServiceNow as a front-end interface allows non-networking personnel, such as data scientists, to request and provision infrastructure without needing to understand the underlying complexities, thus democratizing access to AI resources and streamlining operations.
Personnel: Nick Davey
Design, Deploy, and Operate AI Clusters like a Pro with Juniper Networks
Watch on YouTube
Watch on Vimeo
Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.
Jay Wilson, an architect at Juniper Networks, presented at Cloud Field Day 20, focusing on the deployment and management of AI clusters using Juniper’s Apstra software. Wilson emphasized that Apstra is designed to manage data center fabrics rather than entire data centers, highlighting its ability to handle multiple fabrics and even multiple data centers from a single instance. The presentation aimed to demonstrate how Apstra’s intent-based automation can simplify the complex processes involved in setting up and maintaining AI training clusters. Wilson, with his extensive background in high-performance computing (HPC), underscored the importance of intent as the foundation of Apstra, ensuring that any changes made are validated against predefined goals, thus preventing misconfigurations.
Wilson provided a detailed walkthrough of how Apstra works, particularly focusing on its telemetry and configuration management capabilities. He explained that Apstra collects custom telemetry data, such as explicit congestion notifications (ECNs) and priority flow control (PFC) counters, to monitor the health and performance of AI clusters. This data is crucial for maintaining a lossless environment, which is vital for AI workloads. Wilson also discussed the use of configlets—small pieces of code that allow for fine-tuned adjustments to the network configuration. These configlets are essential for tailoring the environment to meet specific needs without disrupting the overall fabric management that Apstra provides.
The presentation also covered the operational aspects of using Apstra, including its anomaly detection and rollback features. Wilson demonstrated how Apstra’s single source of truth model ensures that all changes are validated and committed atomically, thus maintaining the integrity of the network. He showed how Apstra can identify and troubleshoot issues in real-time, using a combination of service and probe anomalies to pinpoint problems. Additionally, Wilson highlighted the importance of Apstra’s ability to roll back configurations to a previous state, a feature that is particularly useful in dynamic environments where multiple teams are making frequent changes. This capability ensures that any unintended disruptions can be quickly mitigated, thereby maintaining the stability and performance of the AI clusters.
Personnel: Jay Wilson
Automated Congestion Management in the AI Data Center with Juniper Networks
Watch on YouTube
Watch on Vimeo
To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.
Juniper Networks’ presentation at Cloud Field Day 20, led by Rajagopalan Subrahmanian and Vikram Singh, focused on automated congestion management in AI/ML data center fabrics. They began by explaining the challenges faced by network administrators in managing congestion, drawing an analogy to metering lights on freeways that regulate traffic flow. In AI/ML environments, the complexity increases due to the large number of entities that need monitoring and the manual, error-prone process of tuning congestion parameters. Juniper’s solution integrates with their Apstra platform to automate this process, leveraging continuous monitoring and closed-loop automation to optimize network performance dynamically.
The core of Juniper’s approach involves a DCQCN AutoTune application that utilizes Apstra’s capabilities to monitor key performance indicators and adjust network configurations in real-time. By simulating high-traffic scenarios in their lab, they demonstrated how the system detects congestion and uses Terraform to tweak configurations across the network fabric. This automated process helps maintain optimal throughput and the right amount of packet loss, adjusting parameters based on real-time data rather than static, manual settings. The system can apply changes selectively to affected switches or more broadly across similar network segments to preempt potential issues.
Juniper’s method combines two Ethernet congestion control mechanisms: Priority Flow Control (PFC) and Explicit Congestion Notification (ECN). PFC acts as a brute-force method to stop traffic when buffers are nearly full, while ECN offers a more granular approach by marking packets to signal congestion and prompt sender devices to reduce their transmission rates. The DCQCN protocol judiciously uses both techniques to manage congestion effectively. Juniper’s automation adjusts these settings dynamically, ensuring that the network remains stable and efficient under varying loads. The presentation highlighted the flexibility and potential for further customization, including integration with application-level metrics and additional congestion indicators from SmartNICs.
Personnel: Rajagopalan Subrahmanian, Vikram Singh