Watch on YouTube
Watch on Vimeo
Hedgehog CTO Manish Vachharajani reviewed how Hedgehog simplifies AI networking with a Virtual Private Cloud (VPC) abstraction used by customers like Zipline, emphasizing the complexities of designing modern GPU training networks with multiple ports and intricate configurations. Hedgehog addresses this by providing two main abstractions: a low-level wiring diagram for defining physical topology (like leaf/spine connections and AI-specific settings for RDMA traffic), and a VPC operational abstraction for partitioning clusters into multi-tenant environments. This approach leverages the Kubernetes API for configuration, offering a well-known interface with a rich ecosystem of tools for role-based access control and extending its capabilities to manage the physical network. Once the wiring diagram is fed into the Kubernetes API, Hedgehog automates the provisioning, booting, and configuration of network operating systems and agents on the switches, ensuring the specified network policies are enforced.
The core of Hedgehog’s multitenancy solution lies in its VPC abstraction, enabling the creation of isolated network environments with configurable DHCP, IP ranges, and host routes, supporting both L2 and L3 modes. This abstraction automates the complexities of BGP EVPN, VLANs, and route leaks, which are typically manual and error-prone configurations. To facilitate communication between these isolated VPCs, Hedgehog introduces VPC peering, a simple Kubernetes object that automatically configures the necessary route leaks, allowing specified subnets to communicate securely. This eliminates the need for manual route maps and ACLs, significantly simplifying inter-VPC connectivity and reducing the risk of misconfigurations.
Sergei Lukianov, Hedgehog’s Chief Architect, demonstrated the provisioning of tenant VPCs and VPC peering on a three-switch topology (one spine, two leaves). The demo showed that without peering, direct communication between servers in different VPCs (e.g., Server 1 in VPC1 and Server 4 in VPC2) fails. However, by applying a simple peering YAML file to the Kubernetes API, the network automatically reconfigures, enabling successful communication. This process involves the Hedgehog fabric controller translating the peering object into switch configurations, including route leaking between VRFs (Virtual Routing and Forwarding instances). The demonstration also showcased Grafana Cloud integration for collecting and exporting detailed network metrics (counters, queues, logs) from switches and the control node, providing turnkey observability without extensive manual configuration. Manish further explained the limitations of purely switch-based peering for external connectivity, setting the stage for the upcoming discussion on gateway services.
Personnel: Manish Vachharajani, Sergei Lukianov
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!