|
![]() Gavin McKee, Alex Bortok, and Amritam Putatunda presented for Keysight at AI Infrastructure Field Day 2 |
This Presentation date is April 25, 2025 at 10:30 - 12:00.
Presenters: Alex Bortok, Amritam Putatunda, Gavin McKee
Validating Frontend Networks to Optimize and Secure Low- Latency LLM Data Flow with Keysight
Watch on YouTube
Watch on Vimeo
As large language models scale, new challenges emerge – not only in maximizing GPU performance but also in validating the infrastructure that fuels the data pipeline used for training. On the front end, this includes securely ingesting user data from distributed cloud and customer environments into centralized AI data centers and ensuring high-speed, low-latency data transfer between VMs and hosts within the data centers. This session focuses on testing products that evaluate the performance, scalability, latency, reliability, and security of critical data pathways.
The presentation addresses the critical role of front-end networks in ensuring efficient data flow to data-hungry GPUs for LLM training. It highlights the two primary data movement patterns: north-south, involving data ingestion from sources like cloud providers and user environments, and east-west, which focuses on data transfer between virtual machines within the data center. Each of these patterns has unique testing demands, with east-west requiring ultra-low latency and high line rates, while north-south necessitates a balance of minimal latency and strong security to protect user data in transit.
Keysight’s solution, Cyperf, is a software-based traffic generator designed to emulate various application traffic types and measure key performance indicators like bandwidth, latency, connections per second, and security. Cyperf ensures that GPUs receive data with low latency, security, and at the promised rate of the network infrastructure. The presenter emphasized the importance of thoroughly testing all layers of the OSI model, especially the application layer, to avoid relying on end-users as beta testers. The presenter lauded Crusoe for proactively testing their front-end infrastructure to ensure optimal performance.
Personnel: Amritam Putatunda
Building Trust at Scale. How Crusoe Validates Network Infrastructure for AI Workloads with Keysight
Watch on YouTube
Watch on Vimeo
In this session, Crusoe shares how they are actively testing frontend networks and inter-VM/host data transfers that feed their GPU clusters. By validating the performance, reliability, and scalability of its infrastructure early, Crusoe aims to identify and resolve issues internally, minimizing the chance that end customers will discover them first. This is a differentiator for them, which enables a more robust, production-ready AI platform. Crusoe is a vertically aligned AI infrastructure company powered by sustainable energy sources, including wind, solar, and geothermal. They build AI data centers, with a large project underway in Abilene, Texas.
Crusoe’s AI cloud platform offers infrastructure as a service, where customers consume GPU supercomputing via virtualized machines. They also provide managed AI solutions like AI as a service, inference, and workloads. Their mission is to build the world’s favorite AI cloud, purpose-built for AI, with enterprise-scale infrastructure. The company focuses on the design and engineering of data center networks, software-defined networking, and GPU-to-GPU fabrics, all optimized using NVIDIA reference architectures. They emphasize customer support, offering 24/7 assistance to address GPU systems’ complexities and potential issues.
Crusoe partners with Keysight to conduct rigorous testing to ensure optimal performance and stability, particularly focusing on stateful traffic and high connection rates. They simulate various workloads to stress the system and identify breaking points, provide deterministic performance, and prevent noisy neighbor issues in their multi-tenant environment. This proactive approach allows Crusoe to understand the system’s limits and provide transparent performance data to customers, ensuring a world-class service and preventing users from becoming beta testers. They use Cyperf as a traffic generator to understand the behavior of open-source OVS and NVIDIA’s stack to optimize testing. Plans include incorporating Blackwell platforms, advancing telemetry and monitoring, and focusing on storage optimization, scale, and security.
Personnel: Gavin McKee
Maximizing the Performance of AI Backend Fabric with Keysight
Watch on YouTube
Watch on Vimeo
This session provides an overview of the Keysight AI (KAI) Data Center Builder solution and how it supports each phase of AI data center design and deployment with actionable data to improve performance and increase the reliability of AI clusters. The presentation explains how KAI Data Center Builder helps streamline the design process, optimizes resource allocation, and enhances the overall efficiency and stability of AI infrastructures to achieve superior performance and reliability. The discussion focuses on the performance of backend fabrics within AI clusters, highlighting Keysight’s expertise in creating solutions for companies that build various products, including microchips, smartphones, routers, 5G towers, and AI data centers.
Alex Bortok, Lead Product Manager for AI Data Center Solutions at Keysight Technologies, introduces the KAI Data Center Builder product, which is designed to test backend networks GPUs use for data exchange, especially during model training and inference. He distinguishes it from front-end network testing solutions like Keysight Sideperf. The presentation covers the capabilities of the KAI Data Center Builder in emulating AI workloads, benchmarking network infrastructure performance, fine-tuning performance, reproducing production issues, and planning for new designs and topologies. The product emulates the behavior of real GPU servers, reducing the need to purchase them for lab testing.
The presentation also highlights two key KAI Data Center Builder applications: Collective Benchmarks and Workload Emulation. Collective Benchmarks allows users to zoom in on a single transaction between thousands of GPUs, which allows network and system parameters to be fine-tuned. Workload Emulation considers that GPUs not only move data but also compute, and that there is a dependency between how much data you compute and how much data you will move. It is noted that Kai Data Center Builder is a part of the larger Kai Solution Portfolio. The presentation concludes by leading into a KAI Data Center Builder demonstration.
Personnel: Alex Bortok
Demonstrating Keysight’s AI Fabric Test Methodology
Watch on YouTube
Watch on Vimeo
This session provides an overview of the Keysight AI fabric test methodology, demonstrating key findings and improvements achieved through automated testing and the search for optimal configuration parameters. Alex Bortek, Lead Product Manager at Keysight Technologies, introduces the Keysight AI fabric test methodology using the Kai Data Center Builder product. The methodology guides users through the phases of designing and building an AI fabric, emphasizing the importance of topology selection, collective operation algorithms, performance isolation, load balancing, and congestion control. The methodology and related white papers are available for download via a QR code or the link below
The presentation delves into key terminology, including collective operations (broadcast, all-reduce, all-to-all), ranks, collective size, and data size. Metrics such as collective completion time, algorithm bandwidth, and bus bandwidth are defined and used to measure performance. Alex explains how bus bandwidth is a beneficial metric as it removes the number of GPUs from the equation and specifies the limiting factor that defines how long the collective operation will take. A testbed comprising four 800-Gbps port speed switches is described, emulating 16 GPUs/network cards running at 400 Gbps to assess fabric performance.
A demonstration highlights the impact of congestion control on network performance. By comparing scenarios with and without congestion control enabled, the presentation illustrates how fine-tuning DCQCN parameters can optimize bandwidth utilization and reduce congestion. The speaker uses the tool to showcase testing of different settings on the fabric to achieve the optimal configuration. The presentation concludes by mentioning Ultra Ethernet consortium membership and upcoming webinars detailing Keysight’s innovations in AI.
Personnel: Alex Bortok