Juniper advances AI networking software with congestion control, load balancing

Apstra works by keeping a real-time repository of configuration, telemetry and validation information to ensure a network is doing what the organization wants it to do. Companies can use Apstra’s automation capabilities to deliver consistent network and security policies for workloads across physical and virtual infrastructures. In addition, Apstra performs regular network checks to safeguard configurations. It’s hardware agnostic, so it can be integrated to work with Juniper’s networking products as well as boxes from Cisco, Arista, Dell, Microsoft and Nvidia.

Load balancing and visibility improvements

On the load balancing front, Juniper has added support for dynamic load balancing (DLB) that selects the optimal network path and delivers lower latency, better network utilization, and faster job completion times. From the AI workload perspective, this results in better AI workload performance and higher utilization of expensive GPUs, according to Sanyal.

“Compared to traditional static load balancing, DLB significantly enhances fabric bandwidth utilization. But one of DLB’s limitations is that it only tracks the quality of local links instead of understanding the whole path quality from ingress to egress node,” Sanyal wrote. “Let’s say we have CLOS topology and server 1 and server 2 are both trying to send data called flow-1 and flow-2, respectively. In the case of DLB, leaf-1 only knows the local links utilization and makes decisions based solely on the local switch quality table where local links may be in perfect state. But if you use GLB, you can understand the whole path quality where congestion issues are present within the spine-leaf level.”

In terms of visibility, Sanyal pointed out limitations in existing network performance management technologies:

“Today, admins can find out where congestion occurs by observing only the network switches. But they don’t have any visibility into which endpoints (GPUs, in the case of AI data centers) are impacted by the congestion. This leads to challenges in identifying and resolving performance issues. In a multi-training job environment, just by looking at switch telemetry, it is impossible to find which training jobs have been slowed down due to congestion without manually checking the NIC RoCE v2 stats on all the servers, which is not practical,” Sanyal wrote.

Juniper is addressing the issue by integrating RoCE v2 streaming telemetry from the AI Server SmartNICs with Juniper Apstra and correlating existing network switch telemetry; that integration and correlation “greatly enhances the observability and debugging workflows when performance issues occur,” Sanyal wrote. “This correlation allows for a more holistic network view and a better understanding of the relationships between AI servers and network behaviors. The real-time data provides insights into network performance, traffic patterns, potential congestion points, and impacted endpoints, helping identify performance bottlenecks and anomalies.”



Source link