|
This video is part of the appearance, “Keysight Presents at AI Infrastructure Field Day 2“. It was recorded as part of AI Infrastructure Field Day 2 at 10:30 - 12:00 on April 25, 2025.
Watch on YouTube
Watch on Vimeo
In this session, Crusoe shares how they are actively testing frontend networks and inter-VM/host data transfers that feed their GPU clusters. By validating the performance, reliability, and scalability of its infrastructure early, Crusoe aims to identify and resolve issues internally, minimizing the chance that end customers will discover them first. This is a differentiator for them, which enables a more robust, production-ready AI platform. Crusoe is a vertically aligned AI infrastructure company powered by sustainable energy sources, including wind, solar, and geothermal. They build AI data centers, with a large project underway in Abilene, Texas.
Crusoe’s AI cloud platform offers infrastructure as a service, where customers consume GPU supercomputing via virtualized machines. They also provide managed AI solutions like AI as a service, inference, and workloads. Their mission is to build the world’s favorite AI cloud, purpose-built for AI, with enterprise-scale infrastructure. The company focuses on the design and engineering of data center networks, software-defined networking, and GPU-to-GPU fabrics, all optimized using NVIDIA reference architectures. They emphasize customer support, offering 24/7 assistance to address GPU systems’ complexities and potential issues.
Crusoe partners with Keysight to conduct rigorous testing to ensure optimal performance and stability, particularly focusing on stateful traffic and high connection rates. They simulate various workloads to stress the system and identify breaking points, provide deterministic performance, and prevent noisy neighbor issues in their multi-tenant environment. This proactive approach allows Crusoe to understand the system’s limits and provide transparent performance data to customers, ensuring a world-class service and preventing users from becoming beta testers. They use Cyperf as a traffic generator to understand the behavior of open-source OVS and NVIDIA’s stack to optimize testing. Plans include incorporating Blackwell platforms, advancing telemetry and monitoring, and focusing on storage optimization, scale, and security.
Personnel: Gavin McKee