Enfabrica looks to accelerate GPU communication

“The design of today’s supercomputers is not very fault tolerant, and they have to really go through a lot of effort to handle failures correctly,” Mukherjee said.

Enfabrica brings fault tolerance to networking design. Rather than point-to-point, there are multiple paths from any point to any other point, so the load can be distributed. In the case of a failure, the system will redistribute the load to a lesser number of links.

“If you look at data centers today, it’s built around this model that a two-socket system is your working set. If things fit in that two-socket server, life is great. The moment it’s outside [those boundaries], it’s not that efficient,” said Mukherjee.

“We finally concluded that the architecture itself needs to change, and the way you solve that problem needs to be addressed,” Mukherjee said. “We said it has to be a silicon company. It has to be something that that builds around this idea of what the modern system needs to look like and enables that in a fast and complete way.”

ACF-S delivers multi-terabit switching and bridging between heterogeneous compute and memory resources in a single silicon die without changing physical interfaces, protocols or software layers above device drivers. It reduces the number of devices, I/O latency hops, and device power usage in today’s AI clusters consumed by top-of-rack network switches, RDMA-over-Ethernet NICs, Infiniband HCAs, PCIe/CXL switches, and CPU-attached DRAM.

CXL memory bridging allows it to deliver headless memory scaling to any accelerator, enabling a single GPU rack to have direct, low-latency, uncontended access to local CXL.mem DDR5 DRAM at more than 50 times greater memory capacity versus GPU-native High-Bandwidth Memory (HBM) used on GPUs.



Source link