Watch on YouTube
Watch on Vimeo
To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.
Juniper Networks’ presentation at Cloud Field Day 20, led by Rajagopalan Subrahmanian and Vikram Singh, focused on automated congestion management in AI/ML data center fabrics. They began by explaining the challenges faced by network administrators in managing congestion, drawing an analogy to metering lights on freeways that regulate traffic flow. In AI/ML environments, the complexity increases due to the large number of entities that need monitoring and the manual, error-prone process of tuning congestion parameters. Juniper’s solution integrates with their Apstra platform to automate this process, leveraging continuous monitoring and closed-loop automation to optimize network performance dynamically.
The core of Juniper’s approach involves a DCQCN AutoTune application that utilizes Apstra’s capabilities to monitor key performance indicators and adjust network configurations in real-time. By simulating high-traffic scenarios in their lab, they demonstrated how the system detects congestion and uses Terraform to tweak configurations across the network fabric. This automated process helps maintain optimal throughput and the right amount of packet loss, adjusting parameters based on real-time data rather than static, manual settings. The system can apply changes selectively to affected switches or more broadly across similar network segments to preempt potential issues.
Juniper’s method combines two Ethernet congestion control mechanisms: Priority Flow Control (PFC) and Explicit Congestion Notification (ECN). PFC acts as a brute-force method to stop traffic when buffers are nearly full, while ECN offers a more granular approach by marking packets to signal congestion and prompt sender devices to reduce their transmission rates. The DCQCN protocol judiciously uses both techniques to manage congestion effectively. Juniper’s automation adjusts these settings dynamically, ensuring that the network remains stable and efficient under varying loads. The presentation highlighted the flexibility and potential for further customization, including integration with application-level metrics and additional congestion indicators from SmartNICs.
Personnel: Rajagopalan Subrahmanian, Vikram Singh
Thank you for being part of the Tech Field Day community! Our mailing list is a great way to stay up to date on our events and technical content, and we appreciate your signup.
We promise that we’ll never spam you, send ads, or sell your information. This list will only be used to communicate with our community about our events and content. And we’ll limit it to no more than one message per week.
Although we only need your email address, it would be nice if you provided a little more information to help us get to know you better!