|
This video is part of the appearance, “Google Cloud Presents at AI Data Infrastructure Field Day 1“. It was recorded as part of AI Data Infrastructure Field Day 1 at 10:30-12:00 on October 2, 2024.
Watch on YouTube
Watch on Vimeo
In his presentation on Google Cloud Storage for AI ML workloads, Dave Stiver, Group Product Manager at Google Cloud, discussed the critical role of cloud storage in the AI data pipeline, particularly focusing on training, checkpoints, and inference. He emphasized the importance of time to serve for machine learning developers, highlighting that while scalability and performance are essential, the ability to interact with object storage through a file interface is crucial for developers who are accustomed to file systems. Stiver introduced two key features, GCS FUSE and Anywhere Cache, which enhance the performance of cloud storage for AI workloads. GCS FUSE allows users to mount cloud storage buckets as local file systems, while Anywhere Cache provides a local zonal cache that significantly boosts data access speeds by caching data close to the accelerators.
Stiver shared a use case involving Woven, the autonomous driving division of Toyota, which transitioned from using Lustre to GCS FUSE for their training jobs. This shift resulted in a 50% reduction in training costs and a 14% decrease in training time, demonstrating the effectiveness of the local cache feature in GCS FUSE. He also explained the functionality of Anywhere Cache, which allows users to cache data in the same zone as their accelerators, providing high bandwidth and efficient data access. The presentation highlighted the importance of understanding the consistency model of the cache and how it interacts with the underlying storage, ensuring that users can effectively manage their data across different regions and zones.
The discussion then shifted to the introduction of Parallel Store, a fully managed parallel file system designed for high-throughput AI workloads. Stiver explained that Parallel Store is built on DAOS technology and targets users who require extremely high performance for their AI training jobs. He emphasized the importance of integrating storage solutions with cloud storage to optimize costs and performance, particularly for organizations that need to manage large datasets across hybrid environments. The presentation concluded with a focus on the evolving landscape of AI workloads and the need for tailored storage solutions that can adapt to the diverse requirements of different applications and user personas within organizations.
Personnel: Dave Stiver