Watch on YouTube
Watch on Vimeo
Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides high throughput to keep GPUs and TPUs fully utilized and enables quick writing and reading for checkpoints.
Currently, many customers leverage parallel file systems (PFSs) like Lustre on-prem. Google Cloud Managed Lustre makes it easier for customers to bring their workloads to the cloud without re-architecting. It optimizes TCO by maximizing the utilization of expensive GPUs and TPUs. The offering is a persistent service deployed co-located with compute for optimal latency, scaling from 18 terabytes to petabyte scale, with sub-millisecond latency and an initial throughput of one terabyte per second.
The service is managed, where customers specify their region, capacity, and throughput needs. Google then deploys the capacity in the background, providing a mount point for easy integration with GCE or GKE. The Google Cloud Managed Luster service has a 99.9% availability SLA in a single zone and is fully POSIX compliant. The service integrates with GKE via a CSI driver and supports Slurm through the cluster toolkit. It also has an integration built for data batch transfer to and from Google Cloud Storage.
Personnel: Dan Eawaz
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!