|
This video is part of the appearance, “Hedgehog Presents at Networking Field Day 38“. It was recorded as part of Networking Field Day 38 at 13:30-14:30 on July 9, 2025.
Watch on YouTube
Watch on Vimeo
Hedgehog CTO Manish Vachharajani reviewed how Hedgehog simplifies AI networking with a Virtual Private Cloud (VPC) abstraction used by customers like Zipline, emphasizing the complexities of designing modern GPU training networks with multiple ports and intricate configurations. Hedgehog addresses this by providing two main abstractions: a low-level wiring diagram for defining physical topology (like leaf/spine connections and AI-specific settings for RDMA traffic), and a VPC operational abstraction for partitioning clusters into multi-tenant environments. This approach leverages the Kubernetes API for configuration, offering a well-known interface with a rich ecosystem of tools for role-based access control and extending its capabilities to manage the physical network. Once the wiring diagram is fed into the Kubernetes API, Hedgehog automates the provisioning, booting, and configuration of network operating systems and agents on the switches, ensuring the specified network policies are enforced.
The core of Hedgehog’s multitenancy solution lies in its VPC abstraction, enabling the creation of isolated network environments with configurable DHCP, IP ranges, and host routes, supporting both L2 and L3 modes. This abstraction automates the complexities of BGP EVPN, VLANs, and route leaks, which are typically manual and error-prone configurations. To facilitate communication between these isolated VPCs, Hedgehog introduces VPC peering, a simple Kubernetes object that automatically configures the necessary route leaks, allowing specified subnets to communicate securely. This eliminates the need for manual route maps and ACLs, significantly simplifying inter-VPC connectivity and reducing the risk of misconfigurations.
Sergei Lukianov, Hedgehog’s Chief Architect, demonstrated the provisioning of tenant VPCs and VPC peering on a three-switch topology (one spine, two leaves). The demo showed that without peering, direct communication between servers in different VPCs (e.g., Server 1 in VPC1 and Server 4 in VPC2) fails. However, by applying a simple peering YAML file to the Kubernetes API, the network automatically reconfigures, enabling successful communication. This process involves the Hedgehog fabric controller translating the peering object into switch configurations, including route leaking between VRFs (Virtual Routing and Forwarding instances). The demonstration also showcased Grafana Cloud integration for collecting and exporting detailed network metrics (counters, queues, logs) from switches and the control node, providing turnkey observability without extensive manual configuration. Manish further explained the limitations of purely switch-based peering for external connectivity, setting the stage for the upcoming discussion on gateway services.
Personnel: Manish Vachharajani, Sergei Lukianov