|
This video is part of the appearance, “Juniper Networks Presents at Cloud Field Day 20“. It was recorded as part of Cloud Field Day 20 at 8:00-11:30 on June 12, 2024.
Watch on YouTube
Watch on Vimeo
Struggling with where to start with your on-prem AI training cluster? Juniper validated designs (JVDs) are rigorously pre-tested to make sure your deployments are relatively pain-free, and we now offer JVDs to meet the specific needs of AI data centers. See how Apstra does the Day 0/1/2 heavy lifting for you with intent-based automation.
Jay Wilson, an architect at Juniper Networks, presented at Cloud Field Day 20, focusing on the deployment and management of AI clusters using Juniper’s Apstra software. Wilson emphasized that Apstra is designed to manage data center fabrics rather than entire data centers, highlighting its ability to handle multiple fabrics and even multiple data centers from a single instance. The presentation aimed to demonstrate how Apstra’s intent-based automation can simplify the complex processes involved in setting up and maintaining AI training clusters. Wilson, with his extensive background in high-performance computing (HPC), underscored the importance of intent as the foundation of Apstra, ensuring that any changes made are validated against predefined goals, thus preventing misconfigurations.
Wilson provided a detailed walkthrough of how Apstra works, particularly focusing on its telemetry and configuration management capabilities. He explained that Apstra collects custom telemetry data, such as explicit congestion notifications (ECNs) and priority flow control (PFC) counters, to monitor the health and performance of AI clusters. This data is crucial for maintaining a lossless environment, which is vital for AI workloads. Wilson also discussed the use of configlets—small pieces of code that allow for fine-tuned adjustments to the network configuration. These configlets are essential for tailoring the environment to meet specific needs without disrupting the overall fabric management that Apstra provides.
The presentation also covered the operational aspects of using Apstra, including its anomaly detection and rollback features. Wilson demonstrated how Apstra’s single source of truth model ensures that all changes are validated and committed atomically, thus maintaining the integrity of the network. He showed how Apstra can identify and troubleshoot issues in real-time, using a combination of service and probe anomalies to pinpoint problems. Additionally, Wilson highlighted the importance of Apstra’s ability to roll back configurations to a previous state, a feature that is particularly useful in dynamic environments where multiple teams are making frequent changes. This capability ensures that any unintended disruptions can be quickly mitigated, thereby maintaining the stability and performance of the AI clusters.
Personnel: Jay Wilson