Watch on YouTube
Watch on Vimeo
In this session, Crusoe shares how they are actively testing frontend networks and inter-VM/host data transfers that feed their GPU clusters. By validating the performance, reliability, and scalability of its infrastructure early, Crusoe aims to identify and resolve issues internally, minimizing the chance that end customers will discover them first. This is a differentiator for them, which enables a more robust, production-ready AI platform. Crusoe is a vertically aligned AI infrastructure company powered by sustainable energy sources, including wind, solar, and geothermal. They build AI data centers, with a large project underway in Abilene, Texas.
Crusoe’s AI cloud platform offers infrastructure as a service, where customers consume GPU supercomputing via virtualized machines. They also provide managed AI solutions like AI as a service, inference, and workloads. Their mission is to build the world’s favorite AI cloud, purpose-built for AI, with enterprise-scale infrastructure. The company focuses on the design and engineering of data center networks, software-defined networking, and GPU-to-GPU fabrics, all optimized using NVIDIA reference architectures. They emphasize customer support, offering 24/7 assistance to address GPU systems’ complexities and potential issues.
Crusoe partners with Keysight to conduct rigorous testing to ensure optimal performance and stability, particularly focusing on stateful traffic and high connection rates. They simulate various workloads to stress the system and identify breaking points, provide deterministic performance, and prevent noisy neighbor issues in their multi-tenant environment. This proactive approach allows Crusoe to understand the system’s limits and provide transparent performance data to customers, ensuring a world-class service and preventing users from becoming beta testers. They use Cyperf as a traffic generator to understand the behavior of open-source OVS and NVIDIA’s stack to optimize testing. Plans include incorporating Blackwell platforms, advancing telemetry and monitoring, and focusing on storage optimization, scale, and security.
Personnel: Gavin McKee
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!