|
Linas Dauksa, Ankur Sheth, and Alex Bortok presented for Keysight at AI Field Day 5 |
This Presentation date is September 11, 2024 at 8:00-9:30.
Presenters: Alex Bortok, Ankur Sheth, Linas Dauksa
Download a whitepaper at https://www.keysight.com/us/en/assets/7124-1061/white-papers/How-AI-ML-Networks-Differ-from-Traditional-Networks.pdf, a solution brief at https://www.keysight.com/us/en/assets/3123-1809/solution-briefs/Keysight-AI-Data-Center-Test-Platform.pdf, or visit https://www.keysight.com/us/en/products/network-test/protocol-load-test/ai-data-center-test-platform.html for more information.
Test Tomorrow’s AI Networks Today with Keysight
Watch on YouTube
Watch on Vimeo
AI deployment is growing rapidly and the race to train and deliver new AI models quickly and efficiently is a top priority. The Keysight AI Data Center Test Platform is designed to accelerate innovation in AI network fabric validation and optimization, enabling you to test today’s AI networks with confidence. This presentation introduces Keysight, the challenges our customers face and why realistic emulation and testing of AI workloads is critical.
In the presentation by Ankur Sheth from Keysight Technologies, the focus is on the rapid growth of AI deployment and the critical need for effective testing of AI network infrastructures. Keysight, with its rich history stemming from Hewlett Packard, has established itself as a leader in test and measurement solutions across various technology sectors. The company has evolved through acquisitions and innovations, positioning itself to address the unique challenges posed by the increasing complexity of AI networks. As AI technologies proliferate, particularly in hyperscale environments, the demand for robust testing solutions becomes paramount to ensure that the underlying infrastructure can support the high bandwidth, low latency, and reliability required for optimal performance.
Sheth highlights the significant role that network failures play in the inefficiencies of AI training jobs, noting that 20% of failures can be attributed to network issues. With GPUs being the most expensive resources in AI infrastructures, it is crucial to minimize their idle time caused by data transfer delays. The challenges of testing at scale are compounded by the high costs and limited availability of GPUs, making it impractical to create large test environments. As a result, the need for realistic emulation and testing of AI workloads is emphasized, as it allows operators to identify and resolve potential network issues before deploying their systems in production.
To address these challenges, Keysight introduces its AI Data Center Test Platform, which combines advanced hardware and software solutions tailored for testing AI network fabrics. This platform enables testing without the need for physical GPUs, thereby alleviating some of the cost and resource constraints faced by operators. The presentation sets the stage for a deeper exploration of the specific tools and methodologies that Keysight offers, such as the ARIES-1 platform of traffic generators, which are designed to facilitate effective testing and validation of AI networks. By providing these innovative solutions, Keysight aims to empower its customers to accelerate their AI initiatives and ensure the reliability of their network infrastructures.
Personnel: Ankur Sheth
Keysight AI Data Center Test Platform Architecture and Capabilities
Watch on YouTube
Watch on Vimeo
Keysight’s AI Data Center Test Platform is designed to emulate AI workloads, enabling users to benchmark and validate the performance of AI infrastructure in both pre-deployment labs and production AI clusters. The platform allows AI operators and equipment vendors to enhance the efficiency of AI model training over Ethernet networks by experimenting with various workload parameters and network designs. Notably, the platform provides comprehensive insights into the performance of communications and RDMA transports without the need for GPUs, making it a cost-effective solution for testing and optimization.
During the presentation, Alex Bortok and Ankur Sheth discussed the critical role of network performance in AI training, emphasizing that a significant portion of GPU time is spent on data communication rather than computation. They highlighted the importance of co-tuning the software stack and network components to achieve optimal performance, particularly as AI workloads grow in complexity and size. The speakers also explained the challenges associated with traditional benchmarking methods, which often fail to correlate performance metrics across different components of the AI infrastructure. The AI Data Center Test Platform addresses these challenges by providing a controlled environment for emulating workloads and generating real traffic, allowing for more accurate performance assessments.
The architecture of the platform is built on Keysight’s Aries 1 series of traffic generators, which can produce Rocky traffic at line rates. The platform’s software stack is API-driven, enabling users to conduct collective benchmarks and analyze results effectively. The presenters outlined the various testing capabilities offered by the platform, including load balancing, congestion control, and topology experimentation, all aimed at reducing the time required for AI model training. By providing deeper insights and repeatable testing conditions, Keysight’s AI Data Center Test Platform positions itself as a valuable tool for optimizing AI infrastructure and accelerating the deployment of AI models.
Personnel: Alex Bortok, Ankur Sheth
Taking the Keysight AI Data Center Test Platform for a Test Drive
Watch on YouTube
Watch on Vimeo
This demonstration of the AI Data Center Test Platform shows how network events impact completion times. The first demo showcases the effects of congestion on completion times and how poor fabric utilization impacts performance. You’ll also see how the AI Data Center Test Platform can show how increasing parallelism of data transfer helps improve utilization and completion times.
In the presentation by Keysight Technologies at AI Field Day 5, Ankur Sheth, Director of AI Test R&D, demonstrated the AI Data Center Test Platform, focusing on how network events impact completion times. The setup involved emulating a server with eight GPUs connected to a two-tier fabric network, using the Arise 1 box to simulate the GPUs and network interface cards (NICs). The demonstration aimed to show the effects of network congestion on performance and how increasing the parallelism of data transfer can improve fabric utilization and completion times. The first scenario examined the impact of congestion on the network, revealing poor performance due to misconfigured congestion control settings.
Sheth explained the configuration and results of running an All Reduce Collective operation, which is commonly used during the backward pass of a training job. The initial test showed that the network’s poor configuration led to low utilization and high latency, with only 25% of the theoretical throughput achieved. Detailed flow completion times and cumulative distribution functions (CDFs) highlighted significant discrepancies in data transfer times, indicating a problem in the network configuration. After adjusting the network settings, particularly the Priority Flow Control (PFC) settings, the performance improved dramatically, achieving 95% utilization and significantly reducing completion times.
In a second experiment, Sheth demonstrated the impact of using different algorithms and increasing the number of Q-Pairs, which are connections used in the RDMA over Converged Ethernet (RoCE) protocol. The halving-doubling algorithm initially showed average performance with significant tail latencies. By increasing the Q-Pairs from one to eight, the network’s performance improved, with more parallel and consistent data transfer times. This change allowed the network to better load balance the traffic, resulting in more efficient utilization. The presentation concluded with a demonstration of how the platform’s metrics and data can be integrated into automated test cases and analyzed using tools like Jupyter notebooks, providing valuable insights for network designers and engineers.
Personnel: Ankur Sheth