Industry groups drive Ethernet upgrades for AI, HPC

UEC version 1.0

Work on the UEC specifications is following what the group calls a very aggressive timeline, with version 1.0 slated to be released in the third quarter of 2024. The UEC 1.0 Overview explains some of the group’s priorities for the forthcoming specification.

“Even when considering the advantages of using Ethernet, improvements can and should be made,” the UEC stated. “Networks must evolve to better deliver this unprecedented performance for the increased scale and higher bandwidth of networks of the future. Paramount is the need to have the network support delivery of messages to all participating endpoints as quickly as possible, without long delays for even a few endpoints.”

For example, the UEC cites the need to minimize “tail latency” in the training of AI models: “Training consists of frequent computation and communications phases, where the initiation of the next phase of the training is dependent on the completion of the communication phase across the suite of GPUs. The last message to arrive gates the progress of all GPUs. This tail latency – measured by the arrival time of the last message in the communication phase – is a critical metric in system performance.”

To achieve low tail latency, the UEC specification will address critical networking requirements for the next generation of applications, including:

  • Multi-pathing and packet spraying
  • Flexible delivery order
  • Modern congestion control mechanisms
  • End-to-end telemetry
  • Larger scale, stability, and reliability

“This last point places an extra burden on all of the previous ones,” the UEC stated. “High-performance systems leave little margin for error, which compounds in a larger network. Determinism and predictability become more difficult as systems grow, necessitating new methods to achieve holistic stability.”

Another of the challenges UEC is working to address for AI and high-performance networks is setting up the ability to support multiple pathways for communications between clusters.



Source link