Workload and AI-Optimized Infrastructure from Google Cloud

Sean Derrington presented for Google Cloud at AIDIFD1

This video is part of the appearance, “Google Cloud Presents at AI Data Infrastructure Field Day 1“. It was recorded as part of AI Data Infrastructure Field Day 1 at 10:30-12:00 on October 2, 2024.

Watch on YouTube
Watch on Vimeo

Sean Derrington from Google Cloud’s storage group presented on the company’s efforts to optimize AI and workload infrastructure, focusing on the needs of large-scale customers. Google Cloud has been working on a comprehensive system, referred to as the AI hypercomputer, which integrates hardware and software to help customers efficiently manage their AI tasks. The hardware layer includes a broad portfolio of accelerators like GPUs and TPUs, tailored for different workloads. The network capabilities of Google Cloud ensure predictable and consistent performance globally. Additionally, Google Cloud offers various framework packages and managed services like Vertex AI, which supports different AI activities, from building and training models to serving them.

Derrington highlighted the recent release of Parallel Store, Google Cloud’s first managed parallel file system, and Hyperdisk ML, a read-only block storage service. These new storage solutions are designed to handle the specific demands of AI workloads, such as training, checkpointing, and serving. Parallel Store, for instance, is built on local SSDs and is suitable for scratch storage, while Hyperdisk ML allows multiple hosts to access the same data, making it ideal for AI applications. The presentation also touched on the importance of selecting the right storage solution based on the size and nature of the training data set, checkpointing needs, and serving requirements. Google Cloud’s open ecosystem, including partnerships with companies like SciCom, offers additional storage options like GPFS-based solutions.

The presentation emphasized the need for customers to carefully consider their storage requirements, especially as they scale their AI operations. Different storage solutions are suitable for different scales of operations, from small-scale jobs requiring low latency to large-scale, high-throughput needs. Google Cloud aims to provide consistent and flexible storage solutions that can seamlessly transition from on-premises to cloud environments. The goal is to simplify the decision-making process for customers and ensure they have access to the necessary resources, such as H100s, which might not be available on-premises. The session concluded with a promise to delve deeper into the specifics of Parallel Store and other storage solutions, highlighting their unique capabilities and use cases.

Personnel: Sean Derrington

Event Calendar

Latest Coverage

Tech Field Day News