|
This video is part of the appearance, “Google Cloud Presents at AI Infrastructure Field Day 2 – Afternoon“. It was recorded as part of AI Infrastructure Field Day 2 at 13:00 - 16:30 on April 22, 2025.
Watch on YouTube
Watch on Vimeo
Vaibhav Katkade, a Product Manager at Google Cloud Networking, presented on infrastructure enhancements in cloud networking for secure, optimized AI/ML workloads. Focusing on the lifecycle of AI/ML, encompassing training, fine-tuning, and serving/inference, and the corresponding network imperatives for each stage. Data ingestion relies on fast, secure connectivity to on-premises environments via interconnect and cross-cloud interconnect, facilitating high-speed data transfer. GKE clusters now support up to 65,000 nodes, a significant increase in scale, enabling the training of large models like Gemini. Improvements to cloud load balancing enhance performance, particularly for LLM workloads.
A key component discussed was the GKE inference gateway, which optimizes LLM serving. It leverages inference metrics from model servers like VLM, Triton, Dynamo, and Google’s Jetstream to perform load balancing based on KV cache utilization, improving performance. The gateway also supports autoscaling based on model server metrics, dynamically adjusting compute allocation based on request load and GPU utilization. Additionally, it enables multiplexing and loading multiple model use cases on a single base model using LoRa fine-tuned adapters, increasing model serving density and efficient use of accelerators. The gateway supports multi-region capacity chasing and integration with security tools like Google’s Model Armor, Palo Alto Networks, and NVIDIA’s Nemo Guardrails.
Katkade also covered key considerations for running inference at scale on Kubernetes. One significant challenge addressed is the constrained availability of GPU/TPU capacity across regions. Google’s solution allows routing to regions with available capacity through a single inference gateway, streamlining operations and improving capacity utilization. Platform and infrastructure teams gain centralized control and consistent baseline coverage across all models by integrating AI security tools directly at the gateway level. Further discussion included load balancing optimization based on KV cache utilization, achieving up to 60% lower latency and 40% higher throughput. The gateway supports model name-based routing and prioritization, compliant with the OpenAI API spec, and allows for different autoscaling thresholds for production and development workloads.
Personnel: Vaibhav Katkade