|
This video is part of the appearance, “Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon“. It was recorded as part of AI Infrastructure Field Day 2 at 13:00 - 16:30 on April 22, 2025.
Watch on YouTube
Watch on Vimeo
Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides high throughput to keep GPUs and TPUs fully utilized and enables quick writing and reading for checkpoints.
Currently, many customers leverage parallel file systems (PFSs) like Lustre on-prem. Google Cloud Managed Lustre makes it easier for customers to bring their workloads to the cloud without re-architecting. It optimizes TCO by maximizing the utilization of expensive GPUs and TPUs. The offering is a persistent service deployed co-located with compute for optimal latency, scaling from 18 terabytes to petabyte scale, with sub-millisecond latency and an initial throughput of one terabyte per second.
The service is managed, where customers specify their region, capacity, and throughput needs. Google then deploys the capacity in the background, providing a mount point for easy integration with GCE or GKE. The Google Cloud Managed Luster service has a 99.9% availability SLA in a single zone and is fully POSIX compliant. The service integrates with GKE via a CSI driver and supports Slurm through the cluster toolkit. It also has an integration built for data batch transfer to and from Google Cloud Storage.
Personnel: Dan Eawaz