|
This video is part of the appearance, “Juniper Networks Presents at Cloud Field Day 20“. It was recorded as part of Cloud Field Day 20 at 8:00-11:30 on June 12, 2024.
Watch on YouTube
Watch on Vimeo
To maximize throughput and minimize packet loss, Ethernet uses the DCQCN congestion management protocol, but DCQCN introduces significant operational complexity for human operators. Learn how Juniper Apstra handles this new challenge in stride, automatically optimizing throughput and the “right amount” of packet loss.
Juniper Networks’ presentation at Cloud Field Day 20, led by Rajagopalan Subrahmanian and Vikram Singh, focused on automated congestion management in AI/ML data center fabrics. They began by explaining the challenges faced by network administrators in managing congestion, drawing an analogy to metering lights on freeways that regulate traffic flow. In AI/ML environments, the complexity increases due to the large number of entities that need monitoring and the manual, error-prone process of tuning congestion parameters. Juniper’s solution integrates with their Apstra platform to automate this process, leveraging continuous monitoring and closed-loop automation to optimize network performance dynamically.
The core of Juniper’s approach involves a DCQCN AutoTune application that utilizes Apstra’s capabilities to monitor key performance indicators and adjust network configurations in real-time. By simulating high-traffic scenarios in their lab, they demonstrated how the system detects congestion and uses Terraform to tweak configurations across the network fabric. This automated process helps maintain optimal throughput and the right amount of packet loss, adjusting parameters based on real-time data rather than static, manual settings. The system can apply changes selectively to affected switches or more broadly across similar network segments to preempt potential issues.
Juniper’s method combines two Ethernet congestion control mechanisms: Priority Flow Control (PFC) and Explicit Congestion Notification (ECN). PFC acts as a brute-force method to stop traffic when buffers are nearly full, while ECN offers a more granular approach by marking packets to signal congestion and prompt sender devices to reduce their transmission rates. The DCQCN protocol judiciously uses both techniques to manage congestion effectively. Juniper’s automation adjusts these settings dynamically, ensuring that the network remains stable and efficient under varying loads. The presentation highlighted the flexibility and potential for further customization, including integration with application-level metrics and additional congestion indicators from SmartNICs.
Personnel: Rajagopalan Subrahmanian, Vikram Singh