Day 1: Managing your AI data center at scale with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Kyle Baxter

This presentation by Kyle Baxter focuses on how Juniper Networks’ Apstra solution can manage AI data centers at scale. Apstra simplifies network configuration for AI/ML workloads by providing tools to assign virtual networks across numerous ports, an essential capability in environments with potentially millions of ports. The core of the presentation highlights the ability to provision virtual networks and configure load balancing with ease, using an intent-based approach that simplifies complex network tasks. This reduces the burden of manual configuration and allows users to quickly deploy and manage their AI data centers, regardless of the number of GPUs.

Through its continuous validation features, Baxter demonstrates how Apstra allows users to pre-emptively catch configuration issues, such as missing VLAN assignments. This prevents errors before they impact operations. Furthermore, the system enables bulk operations to streamline the assignment of virtual networks and subnets across the entire infrastructure. The solution includes options for selecting load balancing policies, with clear explanations through built-in help text. Juniper focuses on simplifying tasks through an intuitive interface to minimize the need for command-line configuration and facilitate a faster and more efficient deployment process for AI data centers.

Finally, Baxter mentions the delivery of Apstra as a virtual machine, with the solution available for download on the Juniper website. It can be deployed on-premise to manage a network and integrated with MIST AI for AI-driven network operations. The upcoming release of version 6.0 and support for advanced features like RDMA load balancing were also discussed. The key message is that Juniper’s Apstra enables efficient deployment and management of AI infrastructure at scale, regardless of the deployment size, simplifying complex tasks with its user-friendly interface and automated features.


Day 0: Designing your AI data center with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Kyle Baxter

Juniper Networks’ presentation at AI Infrastructure Field Day focuses on designing AI data centers using Apstra, specifically emphasizing rail-optimized designs and highlighting Apstra’s ability to create a fully functional network architecture in just minutes, incorporating native modeling for these specialized designs. Kyle Baxter, Head of Apstra Product Management at Juniper Networks, demonstrates how Apstra simplifies the complex process of deploying AI/ML fabrics by providing expert assistance.

The presentation emphasizes the speed and efficiency that Apstra brings to AI data center design. Baxter showcases how users can input basic architectural requirements, such as the number of GPUs and servers. Apstra will then generate design options, including cabling maps, in a short amount of time. This capability allows users to visualize and validate their designs before ordering hardware, reducing potential errors and streamlining the deployment process. The system enables users to build templates for different design types, such as collapsed rail or L3 Clos, making it easy to reuse designs as needed.

Furthermore, the presentation highlights Apstra’s flexibility and comprehensive approach to network management. Beyond designing the core AI/ML fabric, the platform can manage various network types within a single instance, including storage and front-end networks. Apstra facilitates the generation of configuration data for multiple vendors, ensuring that configurations are correct, and offering validation and continuous monitoring to maintain the integrity of the network. With features like REST-based APIs, Python SDKs, Terraform providers, and Ansible support, users can customize and integrate Apstra with existing workflows, making it a powerful tool for designing and managing AI infrastructure.


GPYOU: Building and Operating your AI Infrastructure with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Kyle Baxter

AI infrastructure is a critical but complex domain, and IT organizations face the pressure to deliver results quickly. Juniper Networks shows Juniper Apstra as a solution to streamline the management of AI data centers, providing proven designs. Kyle Baxter emphasizes the necessity of a robust network foundation for AI and ML workloads and highlights the challenges of traditional network management tools, which often overwhelm users with data, making it challenging to pinpoint root causes and resolve issues efficiently.

Juniper addresses these challenges by offering a comprehensive solution built on the Apstra platform. This platform features a contextual graph database, intent-based networking, and a vendor-agnostic design approach. Combined with Mist AI and the Marvis Virtual Network Assistant, Juniper aims to provide a holistic view of the data center, moving away from managing individual switches to focusing on delivering desired outcomes. This approach simplifies the complex network, allowing for precise identification of root causes, related symptoms, and impacted applications or training jobs.

The presentation focuses on managing the training side of AI and ML clusters. It highlights Apstra’s global capabilities to manage various data center networks, including back-end, storage, and inference networks, for large enterprises. Juniper offers designs and flexibility to manage any network design using a single tool. The key takeaways are the ability to design, deploy, and assure network operations, utilizing Juniper’s leading switching portfolio and security solutions. This aims to provide a streamlined, efficient, and reliable AI infrastructure management solution.


Securing AI Clusters, Juniper’s Approach to Threat Protection with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Kedar Dhuru

AI clusters are high-value targets for cyber threats, requiring a defense-in-depth strategy to safeguard data, workloads, and infrastructure. Kedar Dhuru highlighted how Juniper’s security portfolio provides end-to-end protection for AI clusters, including secure multitenant environments, without compromising performance. The presentation addressed the challenges of securing AI data centers, focusing on securing WAN and data center interconnect links and preventing data loss, which are amplified by the increased scale, performance, and multi-tenancy requirements of these environments.

Juniper’s approach to securing AI data centers involves several key use cases. These include protecting traffic between data centers, preventing DDoS attacks to maintain uptime, securing north-south traffic at the data center edge with application and network-level threat protection, and implementing segmentation within the data center to prevent lateral movement of threats. These security measures can be applied to traditional data centers and public cloud environments with the same functionalities and adapted for cloud-specific architectures. Juniper focuses on high-speed IPsec connections to ensure data encryption without creating performance bottlenecks.

Juniper uses threat detection capabilities to identify indicators of compromise, including inspecting downloaded software and models for tampering and detecting malicious communication from compromised models. Their solution employs multiple firewalls, machine learning algorithms, threat feeds to detect and block malicious activity, and a domain generation algorithm (DGA) and DNS security to protect against threats. The presentation also highlighted Juniper’s new one-rack-unit firewall with high network throughput and MACSec capabilities, along with multi-tenancy and scale-out firewall architecture features.


Maximize AI Cluster Performance using Juniper Self-Optimizing Ethernet with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Vikram Singh

Vikram Singh, Sr. Product Manager, AI Data Center Solutions at Juniper Networks, discussed maximizing AI cluster performance using Juniper’s self-optimizing Ethernet fabric. As AI workloads scale, high GPU utilization and minimized congestion are critical to maximizing performance and ROI. Juniper’s advanced load balancing innovations deliver a self-optimizing Ethernet fabric that dynamically adapts to congestion and keeps AI clusters running at peak efficiency.

The presentation addressed the unique challenges posed by AI/ML traffic, which is primarily UDP-based with low entropy, bursty flows, and the synchronous compute nature of data parallelism, where GPUs must synchronize gradients after each iteration. This synchronization makes job completion time a key metric, as delays in a single flow can idle many GPUs. Traditional Ethernet, designed for TCP in-order delivery requirements, doesn’t efficiently handle this type of traffic, leading to congestion and performance degradation. Solutions like packet spraying using specialized NICs or distributed scheduled fabrics are expensive and proprietary.

Juniper offers an open, standards-based approach using Ethernet, called AI load balancing, which includes dynamic load balancing (DLB) that enhances static ECMP by tracking link utilization and buffer pressure at microsecond granularity to make informed forwarding decisions. DLB operates in flowlet mode (breaking flows into subflows based on configurable pauses) or packet mode (packet spraying). Global Load Balancing (GLB) enhances DLB by exchanging link quality data between leaves and spines, enabling leaves to make more informed decisions and avoid congested paths. Juniper’s RDMA-aware load balancing (RLB) uses deterministic routing by assigning IP addresses to subflows, eliminating randomness and ensuring consistent high performance, in-order delivery, and non-rail performance without expensive hardware.


AI Unbound, Your Data Center Your Way with Juniper Networks

Event: AI Infrastructure Field Day 2

Appearance: Juniper Networks Presents at AI Infrastructure Field Day 2

Company: Juniper Networks

Video Links:

Personnel: Praful Lalchandani

Praful Lalchandani, VP of Product, Data Center Platforms and AI Solutions at Juniper Networks, opened the presentation by highlighting the rapid growth of the AI data center space and its unique challenges. He noted that Juniper Networks, with its 25 years of experience in networking and security, is uniquely positioned to address these challenges and help customers meet the demands of AI. Juniper is experiencing maximum momentum in the data center space, with revenues exceeding $1 billion in networking alone in 2024. The presentation then dove into the increasing distribution of AI workloads across hyperscalers, Neo Cloud providers, private clouds, and the edge, emphasizing Juniper’s comprehensive portfolio of solutions spanning data center fabrics, interconnectivity, and security.

Lalchandani focused on the critical role of networking in the AI lifecycle, particularly for training and inference. High bandwidth, low latency, and congestion-free networking are essential to optimizing job completion time for training and throughput and minimizing latency for inferencing. The discussion highlighted Juniper’s innovations in this space, including developing AI load balancing capabilities such as Dynamic Load Balancing, Global Load Balancing, and RDMA-aware load balancing. Juniper was the first vendor in the industry to ship a 64-port 800-gig switch, showcasing Juniper’s commitment to providing the bandwidth needed for AI workloads and achieving a leading 800-gig market share.

The presentation also emphasizes AI clusters’ operational complexity and security challenges. Juniper’s Apstra solution offers lifecycle management, from design to deployment to assurance, providing end-to-end congestion visibility and automated remediation recommendations. Security is paramount, and Juniper advocates a defense-in-depth approach with its SRX portfolio, protecting the edge, east-west security within the fabric, encrypted data center interconnects, and security for public cloud applications. The presentation concluded by addressing the dilemma customers face between open, best-of-breed technologies and proprietary, tightly coupled ecosystems, and that Juniper offers validated designs with AI labs to show customers they don’t have to make a trade-off and can get the best of both worlds.


Secure and optimize AI and ML workloads with the Cross-Cloud Network with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Vaibhav Katkade

Vaibhav Katkade, a Product Manager at Google Cloud Networking, presented on infrastructure enhancements in cloud networking for secure, optimized AI/ML workloads. Focusing on the lifecycle of AI/ML, encompassing training, fine-tuning, and serving/inference, and the corresponding network imperatives for each stage. Data ingestion relies on fast, secure connectivity to on-premises environments via interconnect and cross-cloud interconnect, facilitating high-speed data transfer. GKE clusters now support up to 65,000 nodes, a significant increase in scale, enabling the training of large models like Gemini. Improvements to cloud load balancing enhance performance, particularly for LLM workloads.

A key component discussed was the GKE inference gateway, which optimizes LLM serving. It leverages inference metrics from model servers like VLM, Triton, Dynamo, and Google’s Jetstream to perform load balancing based on KV cache utilization, improving performance. The gateway also supports autoscaling based on model server metrics, dynamically adjusting compute allocation based on request load and GPU utilization. Additionally, it enables multiplexing and loading multiple model use cases on a single base model using LoRa fine-tuned adapters, increasing model serving density and efficient use of accelerators. The gateway supports multi-region capacity chasing and integration with security tools like Google’s Model Armor, Palo Alto Networks, and NVIDIA’s Nemo Guardrails.

Katkade also covered key considerations for running inference at scale on Kubernetes. One significant challenge addressed is the constrained availability of GPU/TPU capacity across regions. Google’s solution allows routing to regions with available capacity through a single inference gateway, streamlining operations and improving capacity utilization. Platform and infrastructure teams gain centralized control and consistent baseline coverage across all models by integrating AI security tools directly at the gateway level. Further discussion included load balancing optimization based on KV cache utilization, achieving up to 60% lower latency and 40% higher throughput. The gateway supports model name-based routing and prioritization, compliant with the OpenAI API spec, and allows for different autoscaling thresholds for production and development workloads.


Cloud WAN Connecting networks for the AI Era with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Aniruddha Agharkar

This presentation by Aniruddha Agharkar, Product Manager at Google Cloud Networking, centers on Cloud WAN, Google’s fully managed backbone solution designed for the enterprise era and powered by Google’s planet-scale network. Customers have historically relied on bespoke networks using leased lines and MPLS providers, leading to inconsistencies in security, operational challenges, a lack of visibility, and slow adoption of cloud technologies. Cloud WAN addresses these issues by providing a consistent network underpinned by Google’s high-performance, reliable infrastructure that supports billions of users daily, including services like YouTube and Google Maps.

Cloud WAN offers any-to-any connectivity, serving as a centerpiece for connecting branch offices, data centers, and other cloud providers, with the potential to reduce total cost of ownership by up to 40% and improve application performance by a similar margin. Key use cases include enabling high-performance connectivity between various network locations and consolidating branch offices into the Google Cloud ecosystem. To facilitate this, Google is introducing cross-site interconnect, providing Layer 2 connectivity across different sites, backed by the same infrastructure as cloud interconnects, decoupling cloud infrastructure from site-to-site connectivity.

The presentation also highlights the importance of SD-WAN integration, allowing customers to bring their SD-WAN head-ends into the cloud and leverage premium tier networking for faster and more reliable connectivity. Google’s Verified Peering Program (VPP) identifies and badges validated ISPs to ensure optimal last-mile connectivity into Google Cloud. Success stories from customers like Nestle demonstrate the benefits of Cloud WAN, including enhanced stability, reliability, cost efficiency, and improved latency. Cloud WAN and associated services like cross-site interconnects, SD-WAN appliances, and NCC Gateway are accessible through the same Google Cloud console, streamlining management and offering a unified networking solution.


AI Hypercomputer and TPU (Tensor) acceleration with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Rose Zhu

Rose Zhu, a Product Manager at Google Cloud TPU, presented on TPUs for large-scale training and inference, emphasizing the rapid growth of AI models and the corresponding demands for compute, memory, and networking. Zhu highlighted the specialization of Google’s TPU chips and systems, purpose-built ASICs for machine learning applications, coupled with innovations in power efficiency, networking using Jupiter optical networks and ICI, and liquid cooling. A key focus was on co-designing TPUs with software, enabling them to function as a supercomputer, supported by frameworks like JAX and PyTorch, and a low-level compiler (XLA) to maximize performance.

Showcasing real-world TPU usage, powering Google’s internal applications like Gmail and YouTube, and serving external cloud customers across various segments like Anthropic, Salesforce, Mercedes, and Kakao. The adoption of Cloud TPUs has seen significant growth, with an eightfold increase in chip-per-hour consumption within 12 months. A major announcement was the upcoming 7th generation TPU, Ironwood, slated for general availability in Q4 2025, featuring two configurations, TPU7 and TPU7X, to address diverse data center requirements and customer needs for locality and low latency.

Zhu detailed the specifications of Ironwood, including its BF16 and FP8 support, teraflops performance, and high bandwidth memory. Ironwood boasts significant performance and power efficiency improvements compared to previous TPU generations. Rose also touched on optimizing TPU performance through techniques like flash attention, host DRAM offload, mixed precision training, and an inference stack for TPU. GKE manages TPU for orchestration, focusing on scheduling goodput and runtime goodput. Zhu highlighted GKE’s capabilities in managing large-scale training and inference, emphasizing scheduling and runtime efficiency improvements.


AI hypercomputer and GPU acceleration with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Dennis Lu

Dennis Lu, a Product Manager at Google Cloud specializing in GPUs, presented on AI hypercomputer and GPU acceleration with Google Cloud. Lu covered Google Cloud’s AI hypercomputer, from consumption models to purpose-built hardware. Focus was given to Google’s cluster director for managing GPU fleets.

Dennis then moved to the hardware aspect of Google Cloud’s AI infrastructure, discussing current and upcoming GPU systems. Available systems include A3 Ultra (H200 GPUs), A4 (B200 GPUs), and A4X (GB200 systems), which are built on Rocky on CX-7. Also discussed were two systems coming in 2025, the NVIDIA RTX Pro 6000 and a GB300 system, offering advancements in memory and networking.

The presentation also featured performance projections for LLM training, with A4 offering approximately 2x the performance of H100s. The A4 was described as a Goldilocks solution due to its balance of price and performance. There was also discussion on whether Hopper-generation GPUs would decrease in price because of newer generations of hardware.


Analytics Storage and AI, Data Prep and Data Lakes with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Vivek Saraswat

Vivek Sarswat, Group Product Manager at Google Cloud Storage, presented on analytics storage and AI, focusing on data preparation and data lakes. He emphasized the close ties between analytics and AI workloads, highlighting key innovations built to address related challenges. The presentation demonstrates that analytics play a crucial role in the AI data pipeline, particularly in ingestion, data preparation, and cleaning.

Sarswat explained how customers increasingly build unified data lake houses using open metadata table formats like Apache Iceberg. This approach enables analytics and AI workloads, including running analytics on AI data. He cited Snap as a customer example, processing trillions of user events weekly using Spark for data preparation and cleaning on top of Google Cloud Storage. Google Cloud Storage offers optimizations like the Cloud Storage Connector, Anywhere Cache, and Hierarchical Namespace (HNS) to enhance data preparation.

Sarswat covered the concept of a data lakehouse, combining structured and unstructured data in a unified platform with a separation layer using open table formats. Examples from Snowflake, Databricks, Uber, and Google Cloud’s BigQuery tables for Apache Iceberg illustrated the diverse architectures employed. Sarswat also addressed common customer challenges like data fragmentation, performance bottlenecks, and optimization for resilience, security, and cost, offering solutions like Storage Intelligence, Anywhere Cache, and Bucket Relocate, referencing customer case studies such as Spotify and Two Sigma.


The latest in high-performance storage, Rapid on Colossus with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Michal Szymaniak

Michal Szymaniak, Principal Engineer at Google Cloud, presented on Rapid Storage, a new zonal storage product within the cloud storage portfolio, powered by Google’s foundational distributed file system, Colossus. The goal in designing Rapid Storage was to create a storage system that offers the low latency of block storage, the high throughput of parallel file systems, and the ease of use and scale of object storage, while also being easy to start and manage. Rapid Storage addresses customers’ limitations when working with GCS, where they often tier to persistent disk for lower latency or seek parallel file system functionality.

Rapid Storage is fronted by Cloud Storage Fuse, aiming to provide a parallel file system experience with insane throughput of 20 million requests per second and six terabytes per second throughput on the data side, while maintaining the benefits of cloud storage like 11 nines of durability within a zone and three nines of availability. The sub-millisecond latency for random reads and appends, achieved after opening a file using the Fuse interface, makes it significantly faster than other hyperscalers. A separate data stack enables this predictable performance focused on high performance, where GCS emphasizes manageability and scalability.

Rapid Storage, available as a new storage class in zonal locations, leverages Colossus in a particular cloud zone, bringing it to the forefront of cloud storage. By accessing cloud storage directly from the front end, with only one RPC hop away from data on physical hard drives, Rapid Storage minimizes latency and physical hops. Integrating gRPC, a stateful streaming protocol, and hierarchical namespace enhances file system-friendly operations and performance. While currently in private preview with a target for allow-list GA later this year, users are encouraged to sign up and provide feedback.


Intro to Managed Lustre with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon

Company: Google Cloud

Video Links:

Personnel: Dan Eawaz

Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides high throughput to keep GPUs and TPUs fully utilized and enables quick writing and reading for checkpoints.

Currently, many customers leverage parallel file systems (PFSs) like Lustre on-prem. Google Cloud Managed Lustre makes it easier for customers to bring their workloads to the cloud without re-architecting. It optimizes TCO by maximizing the utilization of expensive GPUs and TPUs. The offering is a persistent service deployed co-located with compute for optimal latency, scaling from 18 terabytes to petabyte scale, with sub-millisecond latency and an initial throughput of one terabyte per second.

The service is managed, where customers specify their region, capacity, and throughput needs. Google then deploys the capacity in the background, providing a mount point for easy integration with GCE or GKE. The Google Cloud Managed Luster service has a 99.9% availability SLA in a single zone and is fully POSIX compliant. The service integrates with GKE via a CSI driver and supports Slurm through the cluster toolkit. It also has an integration built for data batch transfer to and from Google Cloud Storage.


Overview of Cloud Storage Storage for AI, Lustre, GCSFuse, and Anywhere cache with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Morning

Company: Google Cloud

Video Links:

Personnel: Marco Abela

Marco Abela, Product Manager at Google Cloud Storage, presented an overview of Google Cloud’s storage solutions optimized for AI/ML workloads. The presentation addressed the critical role of storage in AI pipelines, emphasizing that an inadequate storage solution can significantly bottleneck GPU utilization, causing idle GPUs and hindering data processing from initial data preparation to model serving. He highlighted two industry-optimized storage types: object storage (Cloud Storage) for persistent, high-throughput storage with virtually unlimited capacity, and parallel file systems (Managed Luster) for ultra-low latency, catering to specific workload profiles. The typical storage requirements for AI/ML involve vast capacity, high aggregate throughput, millions of requests per second (QPS/IOPS), and low-latency reads, with varying performance aspects across different training profiles.

The presentation further detailed Cloud Storage Fuse, a solution enabling the mounting of a bucket as a local file system.  Abela highlighted its heavy investment and significant payoff, addressing the need for file system semantics without rewriting applications for object storage. Cloud Storage Fuse now serves as a high-performance client with features like file cache, parallel download, streaming writes, and Hierarchical Namespace bucket integration. The file cache improves training times, while the parallel download feature drastically speeds up model loading, achieving up to 9x faster load times than FSSpec. Hierarchical namespace buckets offer atomic folder renames for checkpointing, resulting in 30x faster performance.

Abela then introduced Anywhere Cache, a newly GA feature designed to improve performance by co-locating storage on SSD in the same zone as compute. This “turbo button” for Cloud Storage simplifies usage, requiring no code refactoring while reducing time to first byte latency by up to 70% for regional buckets and 96% for multi-regional buckets. A GenAI customer case study demonstrated its effectiveness in model loading, achieving a 99% cache hit rate, eliminating tail latencies, and reducing network egress costs using multi-regional buckets. The presentation also detailed a recommender tool that helps users understand the cacheability of their workload, optimal configuration, throughput, and potential cost savings.


Google Kubernetes Engine and AI Hypercomputer with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Morning

Company: Google Cloud

Video Links:

Personnel: Ishan Sharma

Ishan Sharma, Group Product Manager in the Google Kubernetes Engine team, presented on GKE and AI Hypercomputer, focusing on industry-leading infrastructure, training quickly at mega scale, serving with lower cost and latency, economic access to GPUs and TPUs, and faster time to value. He emphasized that Google Cloud is committed to ensuring new accelerators are available on GKE on day one. The AI Hypercomputer, the entire stack, and a reference architecture, is the same stack that Google uses internally for Vertex AI.

The presentation highlighted Cluster Director for GKE, which enables the deployment, scaling, and management of AI-optimized GKE clusters where physically co-located accelerators function as a single unit, delivering high performance and ultra-low latency. Key benefits include running densely co-located accelerators, mega-scale training jobs, topology-aware scheduling, ease of use, 360-degree observability, and resiliency. Cluster Director for GKE uses standard Kubernetes APIs and the existing ecosystem, which allows users to orchestrate these capabilities.

Sharma also demonstrated the GKE Inference Gateway, which enhances LLM inference responses by routing requests based on model server metrics like KVCache and queue line, reducing variability and improving time to first token latency. Additionally, he showcased the GKE Inference Quickstart, a feature on the GKE homepage within the Google Cloud console, which recommends optimized infrastructure configurations for different models, like the Nvidia L4 for Gemma 2 2B instruction-tuned model. This simplifies model deployment and optimizes performance.


AI Hypercomputer Cluster Toolkit with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Morning

Company: Google Cloud

Video Links:

Personnel: Ilias Katsardis

Ilias Katsardis, Senior Product Manager for AI infrastructure at Google Cloud, presented on the AI Hypercomputer Cluster Toolkit, addressing the complexities of deploying AI infrastructure on Google Cloud’s compute engine and GKE. He highlighted the challenges customers face when trying to quickly and efficiently create supercomputers in the cloud, including performance uncertainty, troubleshooting difficulties, and potential downtime. These issues often lead to increased time-to-market and costs, which Google Cloud aims to mitigate.

To tackle these problems, Google Cloud developed ClusterDirector, a foundation built upon purpose-built hardware, VMs, Managed Instance Groups, Kubernetes, and GKE. ClusterDirector includes capabilities such as a placement policy to ensure VMs are located in the same rack and switch for optimal performance. Sitting within ClusterDirector is Cluster Toolkit. Katsardis described Cluster Toolkit as the orchestrator for AI and HPC environments. It utilizes Terraform scripts and APIs to combine everything into a single deployment. Customers can define their AI infrastructure or HPC cluster in a blueprint, a concise configuration file that Cluster Toolkit uses to provision the environment.

The presentation introduced the Cluster Toolkit to simplify the deployment and management of AI infrastructure on Google Cloud, addressing the need for turnkey environments that adhere to best practices. While the underlying infrastructure relies on Terraform, the speaker emphasized that customers interact with a simplified blueprint, enabling easier auditing and faster deployment. The discussion also touched on future directions, including user interfaces to further streamline the process and the potential for managed services.


Storage Intelligence with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Morning

Company: Google Cloud

Video Links:

Personnel: Manjul Sahay

Manjul Sahay, Group Product Manager at Google Cloud Storage, presented on Storage Intelligence with Google Cloud, focusing on helping customers, both enterprises and startups, manage their storage effectively for AI applications. These customers often face challenges in managing storage at scale for security, cost, and operational efficiency, particularly with small and new teams. A key problem is the exponential growth of storage due to the influx of new data and AI-generated content, making object storage management increasingly complex.

The presentation highlighted the difficulties in managing object storage, which often involves billions or trillions of objects spread across multiple buckets and projects. Many customers resort to building custom management tools, leading to significant expenditure on management, sometimes 13% to 24% of the total spend. Google Cloud aims to address this by providing storage intelligence to help customers manage storage, meet them in multi-cloud environments, and reduce or eliminate storage management challenges.

Google Cloud introduced Storage Intelligence to address storage management challenges, offering a unified platform with features like datasets, which provides a metadata repository in BigQuery. Also offered is the ability to identify public objects and batch operations to act on billions of objects, as well as features like bucket migration, all designed to streamline storage management. With a free trial offered, early adopters like Anthropic and Spotify are already experiencing benefits in managing and optimizing their storage infrastructure.


Introduction to the AI Hypercomputer with Google Cloud

Event: AI Infrastructure Field Day 2

Appearance: Google Cloud Presents at AI Infrastructure Field Day 2 – Morning

Company: Google Cloud

Video Links:

Personnel: Sean Derrington

Sean Derrington, Product Manager, Storage at Google Cloud, introduced the AI Hypercomputer at AI Infrastructure Field Day, highlighting Google Cloud’s investments in making it easier for customers to consume and run their AI workloads. The focus is on infrastructure with consideration to the consumption model and optimized software. The AI Hypercomputer encompasses optimized software and purpose-built hardware, with storage, compute, and networking at its foundation.

A key announcement was Google Cloud Managed Luster, a new offering based on their partnership with DDN and Exascaler. Managed Luster provides a scalable parallel file system ideal for AI and ML workloads, offering petabyte-scale storage, low latency (sub-millisecond), and high throughput (up to a terabyte per second).  Google Cloud also announced Anywhere Cache, allowing users to keep data closer to accelerators and improve the performance of AI workloads.  Anywhere Cache enables caching up to a petabyte of capacity within a given zone and delivers high bandwidth, up to 2.5 terabytes per second.  Rapid Storage delivers high QPS, up to 20 million QPS per bucket, and throughput up to 6 terabytes per second for a given bucket.

The presentation also touched on advancements in computing and networking. Google Cloud announced new A4 and A4 Ultra machines within their GPU portfolio in partnership with NVIDIA, and their seventh-generation TPU, Ironwood, which offers significantly higher performance and memory than previous versions, deployed as a cluster with over 9,200 chips, offering 42.5 exaflops of compute capacity. Additionally, improvements to networking infrastructure with Cloud WAN, providing a fully managed service that enhances performance by up to 40% were discussed. Also, GKE inference helps improve AI training through intelligent routing.


Demonstrating Keysight’s AI Fabric Test Methodology

Event: AI Infrastructure Field Day 2

Appearance: Keysight Presents at AI Infrastructure Field Day 2

Company: Keysight Technologies

Video Links:

Personnel: Alex Bortok

This session provides an overview of the Keysight AI fabric test methodology, demonstrating key findings and improvements achieved through automated testing and the search for optimal configuration parameters. Alex Bortek, Lead Product Manager at Keysight Technologies, introduces the Keysight AI fabric test methodology using the Kai Data Center Builder product. The methodology guides users through the phases of designing and building an AI fabric, emphasizing the importance of topology selection, collective operation algorithms, performance isolation, load balancing, and congestion control. The methodology and related white papers are available for download via a QR code or the link below

The presentation delves into key terminology, including collective operations (broadcast, all-reduce, all-to-all), ranks, collective size, and data size. Metrics such as collective completion time, algorithm bandwidth, and bus bandwidth are defined and used to measure performance. Alex explains how bus bandwidth is a beneficial metric as it removes the number of GPUs from the equation and specifies the limiting factor that defines how long the collective operation will take. A testbed comprising four 800-Gbps port speed switches is described, emulating 16 GPUs/network cards running at 400 Gbps to assess fabric performance.

A demonstration highlights the impact of congestion control on network performance. By comparing scenarios with and without congestion control enabled, the presentation illustrates how fine-tuning DCQCN parameters can optimize bandwidth utilization and reduce congestion. The speaker uses the tool to showcase testing of different settings on the fabric to achieve the optimal configuration. The presentation concludes by mentioning Ultra Ethernet consortium membership and upcoming webinars detailing Keysight’s innovations in AI.


Maximizing the Performance of AI Backend Fabric with Keysight

Event: AI Infrastructure Field Day 2

Appearance: Keysight Presents at AI Infrastructure Field Day 2

Company: Keysight Technologies

Video Links:

Personnel: Alex Bortok

This session provides an overview of the Keysight AI (KAI) Data Center Builder solution and how it supports each phase of AI data center design and deployment with actionable data to improve performance and increase the reliability of AI clusters. The presentation explains how KAI Data Center Builder helps streamline the design process, optimizes resource allocation, and enhances the overall efficiency and stability of AI infrastructures to achieve superior performance and reliability. The discussion focuses on the performance of backend fabrics within AI clusters, highlighting Keysight’s expertise in creating solutions for companies that build various products, including microchips, smartphones, routers, 5G towers, and AI data centers.

Alex Bortok, Lead Product Manager for AI Data Center Solutions at Keysight Technologies, introduces the KAI Data Center Builder product, which is designed to test backend networks GPUs use for data exchange, especially during model training and inference. He distinguishes it from front-end network testing solutions like Keysight Sideperf. The presentation covers the capabilities of the KAI Data Center Builder in emulating AI workloads, benchmarking network infrastructure performance, fine-tuning performance, reproducing production issues, and planning for new designs and topologies. The product emulates the behavior of real GPU servers, reducing the need to purchase them for lab testing.

The presentation also highlights two key KAI Data Center Builder applications: Collective Benchmarks and Workload Emulation. Collective Benchmarks allows users to zoom in on a single transaction between thousands of GPUs, which allows network and system parameters to be fine-tuned. Workload Emulation considers that GPUs not only move data but also compute, and that there is a dependency between how much data you compute and how much data you will move. It is noted that Kai Data Center Builder is a part of the larger Kai Solution Portfolio. The presentation concludes by leading into a KAI Data Center Builder demonstration.