|
This video is part of the appearance, “cPacket Presents at Networking Field Day 38“. It was recorded as part of Networking Field Day 38 at 8:00-9:30 on July 10, 2025.
Watch on YouTube
Watch on Vimeo
Modern AI workloads rely on high-performance, low-latency GPU clusters, but traditional observability tools fall short in diagnosing issues across these dense, distributed environments. In this session, cPacket explored how they augment GPU and storage telemetry (DCGM/NVML/IOPS) with full-fidelity packet insights. They covered how to correlate job scheduling, retransmissions, queue depth, and tensor-core utilization in real time, and how to establish performance baselines, auto-trigger mitigations, integrate with SRE dashboards, and continuously tune topologies for maximum AI throughput and resource efficiency. Erik Rudin and Ron Nevo introduced the emerging challenge of AI factories moving into enterprises, contrasting these inference workloads with the well-understood elephant flows of AI training in hyperscale data centers. Inference presents unique, less-understood traffic patterns, often driven by user or agent interactions and characterized by varying query-response ratios and KV cache management policies, all demanding optimal GPU utilization without sacrificing latency.
The core of cPacket’s solution for AI observability lies in supplementing traditional GPU telemetry with packet-level visibility, particularly on the north-south (front-end) network that connects AI clusters to the rest of the enterprise. This integration is crucial for pinpointing the exact source of latency (whether from the cluster, switch, or storage), identifying microbursts that internal switch telemetry might miss, and understanding session-level characteristics that impact AI workload performance. Unlike traditional network monitoring, which often falls short in these highly dynamic and dense environments, cPacket’s approach aims to provide the granular, real-time data necessary for continuous tuning and optimization of AI infrastructures.
Ultimately, cPacket emphasizes that observability for AI is essential for enterprises making significant investments in GPU workloads at the edge. The rapid evolution of AI necessitates a comprehensive approach that integrates packet insights, session metrics, and AI-driven analytics into existing SRE and NetOps workflows. This allows for proactive identification of anomalies, establishment of performance baselines, and continuous optimization of network topologies to ensure maximum AI throughput and resource efficiency, directly impacting the often high costs associated with AI downtime. The overarching message is to start with the business problem–understanding the specific challenges and desired outcomes for AI workloads–and then leverage cPacket’s integrated, open, and AI-infused platform to drive measurable improvements.
Personnel: Erik Rudin, Ron Nevo