|
![]() Vivek Saraswat and Marco Abela presented for Google Cloud at AI Infrastructure Field Day 2 |
This Presentation date is April 22, 2025 at 13:00 - 16:30.
Presenters: Aniruddha Agharkar, Dan Eawaz, Dennis Lu, Ivan Poddubnyy, Michal Szymaniak, Rose Zhu, Vaibhav Katkade, Vivek Saraswat
Intro to Managed Lustre with Google Cloud
Watch on YouTube
Watch on Vimeo
Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides high throughput to keep GPUs and TPUs fully utilized and enables quick writing and reading for checkpoints.
Currently, many customers leverage parallel file systems (PFSs) like Lustre on-prem. Google Cloud Managed Lustre makes it easier for customers to bring their workloads to the cloud without re-architecting. It optimizes TCO by maximizing the utilization of expensive GPUs and TPUs. The offering is a persistent service deployed co-located with compute for optimal latency, scaling from 18 terabytes to petabyte scale, with sub-millisecond latency and an initial throughput of one terabyte per second.
The service is managed, where customers specify their region, capacity, and throughput needs. Google then deploys the capacity in the background, providing a mount point for easy integration with GCE or GKE. The Google Cloud Managed Luster service has a 99.9% availability SLA in a single zone and is fully POSIX compliant. The service integrates with GKE via a CSI driver and supports Slurm through the cluster toolkit. It also has an integration built for data batch transfer to and from Google Cloud Storage.
Personnel: Dan Eawaz
The latest in high-performance storage, Rapid on Colossus with Google Cloud
Watch on YouTube
Watch on Vimeo
Michal Szymaniak, Principal Engineer at Google Cloud, presented on Rapid Storage, a new zonal storage product within the cloud storage portfolio, powered by Google’s foundational distributed file system, Colossus. The goal in designing Rapid Storage was to create a storage system that offers the low latency of block storage, the high throughput of parallel file systems, and the ease of use and scale of object storage, while also being easy to start and manage. Rapid Storage addresses customers’ limitations when working with GCS, where they often tier to persistent disk for lower latency or seek parallel file system functionality.
Rapid Storage is fronted by Cloud Storage Fuse, aiming to provide a parallel file system experience with insane throughput of 20 million requests per second and six terabytes per second throughput on the data side, while maintaining the benefits of cloud storage like 11 nines of durability within a zone and three nines of availability. The sub-millisecond latency for random reads and appends, achieved after opening a file using the Fuse interface, makes it significantly faster than other hyperscalers. A separate data stack enables this predictable performance focused on high performance, where GCS emphasizes manageability and scalability.
Rapid Storage, available as a new storage class in zonal locations, leverages Colossus in a particular cloud zone, bringing it to the forefront of cloud storage. By accessing cloud storage directly from the front end, with only one RPC hop away from data on physical hard drives, Rapid Storage minimizes latency and physical hops. Integrating gRPC, a stateful streaming protocol, and hierarchical namespace enhances file system-friendly operations and performance. While currently in private preview with a target for allow-list GA later this year, users are encouraged to sign up and provide feedback.
Personnel: Michal Szymaniak
Analytics Storage and AI, Data Prep and Data Lakes with Google Cloud
Watch on YouTube
Watch on Vimeo
Vivek Sarswat, Group Product Manager at Google Cloud Storage, presented on analytics storage and AI, focusing on data preparation and data lakes. He emphasized the close ties between analytics and AI workloads, highlighting key innovations built to address related challenges. The presentation demonstrates that analytics play a crucial role in the AI data pipeline, particularly in ingestion, data preparation, and cleaning.
Sarswat explained how customers increasingly build unified data lake houses using open metadata table formats like Apache Iceberg. This approach enables analytics and AI workloads, including running analytics on AI data. He cited Snap as a customer example, processing trillions of user events weekly using Spark for data preparation and cleaning on top of Google Cloud Storage. Google Cloud Storage offers optimizations like the Cloud Storage Connector, Anywhere Cache, and Hierarchical Namespace (HNS) to enhance data preparation.
Sarswat covered the concept of a data lakehouse, combining structured and unstructured data in a unified platform with a separation layer using open table formats. Examples from Snowflake, Databricks, Uber, and Google Cloud’s BigQuery tables for Apache Iceberg illustrated the diverse architectures employed. Sarswat also addressed common customer challenges like data fragmentation, performance bottlenecks, and optimization for resilience, security, and cost, offering solutions like Storage Intelligence, Anywhere Cache, and Bucket Relocate, referencing customer case studies such as Spotify and Two Sigma.
Personnel: Vivek Saraswat
AI hypercomputer and GPU acceleration with Google Cloud
Watch on YouTube
Watch on Vimeo
Dennis Lu, a Product Manager at Google Cloud specializing in GPUs, presented on AI hypercomputer and GPU acceleration with Google Cloud. Lu covered Google Cloud’s AI hypercomputer, from consumption models to purpose-built hardware. Focus was given to Google’s cluster director for managing GPU fleets.
Dennis then moved to the hardware aspect of Google Cloud’s AI infrastructure, discussing current and upcoming GPU systems. Available systems include A3 Ultra (H200 GPUs), A4 (B200 GPUs), and A4X (GB200 systems), which are built on Rocky on CX-7. Also discussed were two systems coming in 2025, the NVIDIA RTX Pro 6000 and a GB300 system, offering advancements in memory and networking.
The presentation also featured performance projections for LLM training, with A4 offering approximately 2x the performance of H100s. The A4 was described as a Goldilocks solution due to its balance of price and performance. There was also discussion on whether Hopper-generation GPUs would decrease in price because of newer generations of hardware.
Personnel: Dennis Lu
AI Hypercomputer and TPU (Tensor) acceleration with Google Cloud
Watch on YouTube
Watch on Vimeo
Rose Zhu, a Product Manager at Google Cloud TPU, presented on TPUs for large-scale training and inference, emphasizing the rapid growth of AI models and the corresponding demands for compute, memory, and networking. Zhu highlighted the specialization of Google’s TPU chips and systems, purpose-built ASICs for machine learning applications, coupled with innovations in power efficiency, networking using Jupiter optical networks and ICI, and liquid cooling. A key focus was on co-designing TPUs with software, enabling them to function as a supercomputer, supported by frameworks like JAX and PyTorch, and a low-level compiler (XLA) to maximize performance.
Showcasing real-world TPU usage, powering Google’s internal applications like Gmail and YouTube, and serving external cloud customers across various segments like Anthropic, Salesforce, Mercedes, and Kakao. The adoption of Cloud TPUs has seen significant growth, with an eightfold increase in chip-per-hour consumption within 12 months. A major announcement was the upcoming 7th generation TPU, Ironwood, slated for general availability in Q4 2025, featuring two configurations, TPU7 and TPU7X, to address diverse data center requirements and customer needs for locality and low latency.
Zhu detailed the specifications of Ironwood, including its BF16 and FP8 support, teraflops performance, and high bandwidth memory. Ironwood boasts significant performance and power efficiency improvements compared to previous TPU generations. Rose also touched on optimizing TPU performance through techniques like flash attention, host DRAM offload, mixed precision training, and an inference stack for TPU. GKE manages TPU for orchestration, focusing on scheduling goodput and runtime goodput. Zhu highlighted GKE’s capabilities in managing large-scale training and inference, emphasizing scheduling and runtime efficiency improvements.
Personnel: Rose Zhu
Cloud WAN Connecting networks for the AI Era with Google Cloud
Watch on YouTube
Watch on Vimeo
This presentation by Aniruddha Agharkar, Product Manager at Google Cloud Networking, centers on Cloud WAN, Google’s fully managed backbone solution designed for the enterprise era and powered by Google’s planet-scale network. Customers have historically relied on bespoke networks using leased lines and MPLS providers, leading to inconsistencies in security, operational challenges, a lack of visibility, and slow adoption of cloud technologies. Cloud WAN addresses these issues by providing a consistent network underpinned by Google’s high-performance, reliable infrastructure that supports billions of users daily, including services like YouTube and Google Maps.
Cloud WAN offers any-to-any connectivity, serving as a centerpiece for connecting branch offices, data centers, and other cloud providers, with the potential to reduce total cost of ownership by up to 40% and improve application performance by a similar margin. Key use cases include enabling high-performance connectivity between various network locations and consolidating branch offices into the Google Cloud ecosystem. To facilitate this, Google is introducing cross-site interconnect, providing Layer 2 connectivity across different sites, backed by the same infrastructure as cloud interconnects, decoupling cloud infrastructure from site-to-site connectivity.
The presentation also highlights the importance of SD-WAN integration, allowing customers to bring their SD-WAN head-ends into the cloud and leverage premium tier networking for faster and more reliable connectivity. Google’s Verified Peering Program (VPP) identifies and badges validated ISPs to ensure optimal last-mile connectivity into Google Cloud. Success stories from customers like Nestle demonstrate the benefits of Cloud WAN, including enhanced stability, reliability, cost efficiency, and improved latency. Cloud WAN and associated services like cross-site interconnects, SD-WAN appliances, and NCC Gateway are accessible through the same Google Cloud console, streamlining management and offering a unified networking solution.
Personnel: Aniruddha Agharkar
Secure and optimize AI and ML workloads with the Cross-Cloud Network with Google Cloud
Watch on YouTube
Watch on Vimeo
Vaibhav Katkade, a Product Manager at Google Cloud Networking, presented on infrastructure enhancements in cloud networking for secure, optimized AI/ML workloads. Focusing on the lifecycle of AI/ML, encompassing training, fine-tuning, and serving/inference, and the corresponding network imperatives for each stage. Data ingestion relies on fast, secure connectivity to on-premises environments via interconnect and cross-cloud interconnect, facilitating high-speed data transfer. GKE clusters now support up to 65,000 nodes, a significant increase in scale, enabling the training of large models like Gemini. Improvements to cloud load balancing enhance performance, particularly for LLM workloads.
A key component discussed was the GKE inference gateway, which optimizes LLM serving. It leverages inference metrics from model servers like VLM, Triton, Dynamo, and Google’s Jetstream to perform load balancing based on KV cache utilization, improving performance. The gateway also supports autoscaling based on model server metrics, dynamically adjusting compute allocation based on request load and GPU utilization. Additionally, it enables multiplexing and loading multiple model use cases on a single base model using LoRa fine-tuned adapters, increasing model serving density and efficient use of accelerators. The gateway supports multi-region capacity chasing and integration with security tools like Google’s Model Armor, Palo Alto Networks, and NVIDIA’s Nemo Guardrails.
Katkade also covered key considerations for running inference at scale on Kubernetes. One significant challenge addressed is the constrained availability of GPU/TPU capacity across regions. Google’s solution allows routing to regions with available capacity through a single inference gateway, streamlining operations and improving capacity utilization. Platform and infrastructure teams gain centralized control and consistent baseline coverage across all models by integrating AI security tools directly at the gateway level. Further discussion included load balancing optimization based on KV cache utilization, achieving up to 60% lower latency and 40% higher throughput. The gateway supports model name-based routing and prioritization, compliant with the OpenAI API spec, and allows for different autoscaling thresholds for production and development workloads.
Personnel: Vaibhav Katkade