|
Dave Stiver, Sean Derrington, and Manjul Sahay presented for Google Cloud at AI Data Infrastructure Field Day 1 |
This Presentation date is October 2, 2024 at 10:30-12:00.
Presenters: Dave Stiver, Dean Hildebrand, Manjul Sahay, Raj Hosamani, Sean Derrington
Workload and AI-Optimized Infrastructure from Google Cloud
Watch on YouTube
Watch on Vimeo
Sean Derrington from Google Cloud’s storage group presented on the company’s efforts to optimize AI and workload infrastructure, focusing on the needs of large-scale customers. Google Cloud has been working on a comprehensive system, referred to as the AI hypercomputer, which integrates hardware and software to help customers efficiently manage their AI tasks. The hardware layer includes a broad portfolio of accelerators like GPUs and TPUs, tailored for different workloads. The network capabilities of Google Cloud ensure predictable and consistent performance globally. Additionally, Google Cloud offers various framework packages and managed services like Vertex AI, which supports different AI activities, from building and training models to serving them.
Derrington highlighted the recent release of Parallel Store, Google Cloud’s first managed parallel file system, and Hyperdisk ML, a read-only block storage service. These new storage solutions are designed to handle the specific demands of AI workloads, such as training, checkpointing, and serving. Parallel Store, for instance, is built on local SSDs and is suitable for scratch storage, while Hyperdisk ML allows multiple hosts to access the same data, making it ideal for AI applications. The presentation also touched on the importance of selecting the right storage solution based on the size and nature of the training data set, checkpointing needs, and serving requirements. Google Cloud’s open ecosystem, including partnerships with companies like SciCom, offers additional storage options like GPFS-based solutions.
The presentation emphasized the need for customers to carefully consider their storage requirements, especially as they scale their AI operations. Different storage solutions are suitable for different scales of operations, from small-scale jobs requiring low latency to large-scale, high-throughput needs. Google Cloud aims to provide consistent and flexible storage solutions that can seamlessly transition from on-premises to cloud environments. The goal is to simplify the decision-making process for customers and ensure they have access to the necessary resources, such as H100s, which might not be available on-premises. The session concluded with a promise to delve deeper into the specifics of Parallel Store and other storage solutions, highlighting their unique capabilities and use cases.
Personnel: Sean Derrington
Google Cloud Storage for AI ML Workloads
Watch on YouTube
Watch on Vimeo
In his presentation on Google Cloud Storage for AI ML workloads, Dave Stiver, Group Product Manager at Google Cloud, discussed the critical role of cloud storage in the AI data pipeline, particularly focusing on training, checkpoints, and inference. He emphasized the importance of time to serve for machine learning developers, highlighting that while scalability and performance are essential, the ability to interact with object storage through a file interface is crucial for developers who are accustomed to file systems. Stiver introduced two key features, GCS FUSE and Anywhere Cache, which enhance the performance of cloud storage for AI workloads. GCS FUSE allows users to mount cloud storage buckets as local file systems, while Anywhere Cache provides a local zonal cache that significantly boosts data access speeds by caching data close to the accelerators.
Stiver shared a use case involving Woven, the autonomous driving division of Toyota, which transitioned from using Lustre to GCS FUSE for their training jobs. This shift resulted in a 50% reduction in training costs and a 14% decrease in training time, demonstrating the effectiveness of the local cache feature in GCS FUSE. He also explained the functionality of Anywhere Cache, which allows users to cache data in the same zone as their accelerators, providing high bandwidth and efficient data access. The presentation highlighted the importance of understanding the consistency model of the cache and how it interacts with the underlying storage, ensuring that users can effectively manage their data across different regions and zones.
The discussion then shifted to the introduction of Parallel Store, a fully managed parallel file system designed for high-throughput AI workloads. Stiver explained that Parallel Store is built on DAOS technology and targets users who require extremely high performance for their AI training jobs. He emphasized the importance of integrating storage solutions with cloud storage to optimize costs and performance, particularly for organizations that need to manage large datasets across hybrid environments. The presentation concluded with a focus on the evolving landscape of AI workloads and the need for tailored storage solutions that can adapt to the diverse requirements of different applications and user personas within organizations.
Personnel: Dave Stiver
Google Cloud Vertex Al & Google Cloud NetApp Volumes
Watch on YouTube
Watch on Vimeo
Rajendraprasad Hosamani from Google Cloud Storage presented on the integration of Google Cloud Vertex AI with Google Cloud NetApp Volumes, emphasizing the importance of grounding AI agents in bespoke, enterprise-specific data. He highlighted that AI workloads are diverse and that agents can significantly enhance user experiences by providing interactive, personalized, and efficient data sharing. For agents to be effective, they must be grounded in the specific truths of an organization, which requires seamless data integration from various sources, whether on-premises or in the cloud. This integration also necessitates robust governance to ensure data is shared appropriately within the enterprise.
Vertex AI, Google’s flagship platform for AI app builders, offers a comprehensive suite of tools categorized into model garden, model builder, and agent builder. The model garden allows users to select from first-party, third-party, or open-source models, while the model builder focuses on creating custom models tailored to specific business needs. The agent builder facilitates the responsible and reliable creation of AI agents, incorporating capabilities like orchestration, grounding, and data extraction. This platform supports no-code, low-code, and full-code development experiences, making it accessible to a wide range of users within an organization.
The integration of NetApp Volumes with Vertex AI enables the use of NetApp’s proven OnTap storage stack as a data store within Vertex AI. This allows for the seamless incorporation of enterprise data into AI development workflows, facilitating the creation, testing, and fine-tuning of AI agents. Raj demonstrated how this integration can elevate user experiences through various implementations, such as chat agents, search agents, and recommendation agents, all of which can be developed with minimal coding. This integration empowers organizations to leverage their accumulated data to create rich, natural language-based interactions for their end users, thereby enhancing the overall value derived from their AI investments.
Personnel: Raj Hosamani
Managing Google Cloud Storage at Scale with Gemini
Watch on YouTube
Watch on Vimeo
In the presentation, Manjul Sahay from Google Cloud discusses the challenges and solutions for managing vast amounts of data in Google Cloud Storage, particularly for enterprises involved in data-intensive activities like autonomous driving and drug discovery. He highlights that traditional methods of data management become ineffective when dealing with billions of objects and petabytes of data. The complexity is compounded by the need for security, cost management, and operational insights, which are difficult to achieve at such a large scale. To address these challenges, Google Cloud has developed new capabilities to streamline the process, making it easier for customers to manage their data efficiently.
One of the key solutions introduced is the Insights Data Set, which aggregates metadata from billions of objects and thousands of buckets into BigQuery for analysis. This daily snapshot of metadata includes custom tags and other relevant information, allowing users to gain insights without the need for extensive manual querying and scripting. This capability is designed to be user-friendly, enabling even non-experts to perform complex data analysis with just a few clicks. By leveraging BigQuery’s powerful tools, users can generate actionable insights quickly, which is crucial for maintaining security and compliance, as well as optimizing storage usage and costs.
Additionally, Google Cloud has integrated AI capabilities through Gemini, a natural language interface that allows users to query metadata in real-time without needing specialized knowledge. This feature democratizes data management by shifting some responsibilities from storage admins to end-users, making the process more efficient and less error-prone. Gemini also provides verified answers to common questions, ensuring accuracy and reliability. The overall goal of these innovations is to help enterprises manage their data at scale, keeping it secure, compliant, and ready for AI applications, thereby enabling them to focus on their core business objectives.
Personnel: Manjul Sahay