|
![]() Florian Berchtold, Marc Austin, Sergei Lukianov and Manish Vachharajani presented for Hedgehog at Networking Field Day 38 |
This Presentation date is July 9, 2025 at 13:30-14:30.
Presenters: Florian Berchtold, Manish Vachharajani, Marc Austin, Sergei Lukianov
Follow on Twitter using the following hashtags or usernames: #NFD38
How Zipline Uses Hedgehog for AI Training
Watch on YouTube
Watch on Vimeo
Zipline is a drone delivery company that trains AI on private cloud infrastructure to autonomously fly drones and drop packages in precise delivery locations. Florian Berchtold, Zipline’s Principal Engineer responsible for AI developer productivity, highlighted Hedgehog’s crucial role in their operations. Zipline chose an on-premises strategy for their AI infrastructure due to significant cost efficiencies and enhanced governance compared to public cloud options. Florian, a software engineer rather than a network engineer, sought a high-bandwidth networking solution that didn’t demand extensive network CLI expertise. Hedgehog provided a Kubernetes-native, declarative API, allowing Zipline to describe their infrastructure’s desired state in a familiar language, abstracting away complex networking configurations like port channels.
Previously, with a smaller server footprint, Zipline utilized Hedgehog for collapsed core designs, achieving high availability and high bandwidth on a modest scale without requiring specialized networking knowledge. Now, with over sixty servers across multiple racks, Hedgehog continues to be their preferred solution, supporting the larger spine-leaf topology required for their expanded infrastructure. However, a gap existed: while Hedgehog solved the internal fabric networking, Zipline still needed to connect their private cloud to the public internet, necessitating a firewall/router solution. This interim solution involved expensive, commodity legacy firewalls that provided far more capability than Zipline needed for the limited bandwidth they utilized, leading to significant unnecessary costs.
Florian anticipates that Hedgehog’s new Transit Gateway demonstration will fill this crucial gap. He expects the gateway to provide essential routing capabilities, allowing their internal private fabric IPs to access the public internet, along with Network Address Translation (NAT) and basic port forwarding to expose on-premise hosted services. This new functionality from Hedgehog aims to replace their costly existing firewalls, offering a more integrated and cost-effective solution that aligns with their cloud-native infrastructure and declarative management approach.
Personnel: Florian Berchtold, Marc Austin
Hedgehog VPC Peering Demonstration
Watch on YouTube
Watch on Vimeo
Hedgehog CTO Manish Vachharajani reviewed how Hedgehog simplifies AI networking with a Virtual Private Cloud (VPC) abstraction used by customers like Zipline, emphasizing the complexities of designing modern GPU training networks with multiple ports and intricate configurations. Hedgehog addresses this by providing two main abstractions: a low-level wiring diagram for defining physical topology (like leaf/spine connections and AI-specific settings for RDMA traffic), and a VPC operational abstraction for partitioning clusters into multi-tenant environments. This approach leverages the Kubernetes API for configuration, offering a well-known interface with a rich ecosystem of tools for role-based access control and extending its capabilities to manage the physical network. Once the wiring diagram is fed into the Kubernetes API, Hedgehog automates the provisioning, booting, and configuration of network operating systems and agents on the switches, ensuring the specified network policies are enforced.
The core of Hedgehog’s multitenancy solution lies in its VPC abstraction, enabling the creation of isolated network environments with configurable DHCP, IP ranges, and host routes, supporting both L2 and L3 modes. This abstraction automates the complexities of BGP EVPN, VLANs, and route leaks, which are typically manual and error-prone configurations. To facilitate communication between these isolated VPCs, Hedgehog introduces VPC peering, a simple Kubernetes object that automatically configures the necessary route leaks, allowing specified subnets to communicate securely. This eliminates the need for manual route maps and ACLs, significantly simplifying inter-VPC connectivity and reducing the risk of misconfigurations.
Sergei Lukianov, Hedgehog’s Chief Architect, demonstrated the provisioning of tenant VPCs and VPC peering on a three-switch topology (one spine, two leaves). The demo showed that without peering, direct communication between servers in different VPCs (e.g., Server 1 in VPC1 and Server 4 in VPC2) fails. However, by applying a simple peering YAML file to the Kubernetes API, the network automatically reconfigures, enabling successful communication. This process involves the Hedgehog fabric controller translating the peering object into switch configurations, including route leaking between VRFs (Virtual Routing and Forwarding instances). The demonstration also showcased Grafana Cloud integration for collecting and exporting detailed network metrics (counters, queues, logs) from switches and the control node, providing turnkey observability without extensive manual configuration. Manish further explained the limitations of purely switch-based peering for external connectivity, setting the stage for the upcoming discussion on gateway services.
Personnel: Manish Vachharajani, Sergei Lukianov
Hedgehog Gateway Demonstration
Watch on YouTube
Watch on Vimeo
Hedgehog CTO Manish Vachharajani explained how Hedgehog gateway peering functions as a new component to overcome limitations of switch-based VPC peering. While switch-based peering offers full cut-through bandwidth, traditional switches lack the CPU and RAM for stateful network functions like firewalling, NAT, and handling large routing tables or TCP termination. The Hedgehog Gateway addresses this by leveraging a CPU-rich, high-bandwidth server positioned in the traffic flow between VPCs. This commodity hardware, combined with modern NICs featuring hardware offloads for NAT and VXLAN, can achieve significant throughput (initially targeting 40 Gbps, with plans for 100 Gbps and higher). The gateway operates by acting as a VTEP and selectively advertising routes to attract specific traffic, performing necessary network transformations (including implied NAT as demonstrated), and then re-encapsulating and transmitting packets to their destination VPC.
Sergei Lukianov, Chief Architect, demonstrated VPC peering with basic firewall functions that aim to replace Zipline’s existing Palo Alto Firewalls. The demo illustrated how the gateway enables communication between VPCs with overlapping IP addresses by performing NAT. This involves the gateway advertising NAT’d IP prefixes into the VRFs of peered VPCs, allowing traffic to be routed through the gateway. The demonstration highlighted the comprehensive visibility provided by Hedgehog’s data plane on the gateway, offering insights into traffic flow that traditional switches often lack. While introducing a slight latency increase due to the additional hops (though the demo used debug images, exaggerating this), the gateway offers significantly more flexibility and functionality than switch-based peering.
Looking ahead, Hedgehog plans to enhance the gateway’s capabilities by moving the software onto DPUs (Data Processing Units) within the host, such as NVIDIA Bluefield, for improved performance and scalability. This approach would significantly reduce latency and allow for deeper network extension into virtual environments like VMs and containers. The gateway also includes basic security functionalities like ACLs and port forwarding, with a roadmap to add more advanced features like DDoS protection, IDS/IPS, and Layer 7 inspection as per customer demand or open-source contributions. Furthermore, Hedgehog aims to support multi-data center deployments through Kubernetes Federation, allowing independent clusters to connect via gateway tunnels while presenting a unified API to the end-user.
Personnel: Manish Vachharajani, Sergei Lukianov