- 3 handy upgrades in MacOS 15.1 - especially if AI isn't your thing (like me)
- Your Android device is vulnerable to attack and Google's fix is imminent
- Microsoft's Copilot AI is coming to your Office apps - whether you like it or not
- How to track US election results on your iPhone, iPad or Apple Watch
- One of the most dependable robot vacuums I've tested isn't a Roborock or Roomba
Juniper advances AI networking software with congestion control, load balancing
Apstra works by keeping a real-time repository of configuration, telemetry and validation information to ensure a network is doing what the organization wants it to do. Companies can use Apstra’s automation capabilities to deliver consistent network and security policies for workloads across physical and virtual infrastructures. In addition, Apstra performs regular network checks to safeguard configurations. It’s hardware agnostic, so it can be integrated to work with Juniper’s networking products as well as boxes from Cisco, Arista, Dell, Microsoft and Nvidia.
Load balancing and visibility improvements
On the load balancing front, Juniper has added support for dynamic load balancing (DLB) that selects the optimal network path and delivers lower latency, better network utilization, and faster job completion times. From the AI workload perspective, this results in better AI workload performance and higher utilization of expensive GPUs, according to Sanyal.
“Compared to traditional static load balancing, DLB significantly enhances fabric bandwidth utilization. But one of DLB’s limitations is that it only tracks the quality of local links instead of understanding the whole path quality from ingress to egress node,” Sanyal wrote. “Let’s say we have CLOS topology and server 1 and server 2 are both trying to send data called flow-1 and flow-2, respectively. In the case of DLB, leaf-1 only knows the local links utilization and makes decisions based solely on the local switch quality table where local links may be in perfect state. But if you use GLB, you can understand the whole path quality where congestion issues are present within the spine-leaf level.”
In terms of visibility, Sanyal pointed out limitations in existing network performance management technologies:
“Today, admins can find out where congestion occurs by observing only the network switches. But they don’t have any visibility into which endpoints (GPUs, in the case of AI data centers) are impacted by the congestion. This leads to challenges in identifying and resolving performance issues. In a multi-training job environment, just by looking at switch telemetry, it is impossible to find which training jobs have been slowed down due to congestion without manually checking the NIC RoCE v2 stats on all the servers, which is not practical,” Sanyal wrote.
Juniper is addressing the issue by integrating RoCE v2 streaming telemetry from the AI Server SmartNICs with Juniper Apstra and correlating existing network switch telemetry; that integration and correlation “greatly enhances the observability and debugging workflows when performance issues occur,” Sanyal wrote. “This correlation allows for a more holistic network view and a better understanding of the relationships between AI servers and network behaviors. The real-time data provides insights into network performance, traffic patterns, potential congestion points, and impacted endpoints, helping identify performance bottlenecks and anomalies.”