New Oak Ridge supercomputer outperforms the old in a fraction of the space
The conventional wisdom is that you should update your IT gear, namely the servers, every three-to-five years, which is usually when service warranties run out. However, some companies hold onto their gear for longer than that for a variety of reasons: lack of funds, business uncertainty, on-premises versus the cloud, and so forth.
And for a while, the CPU guys were helping. New generations of processors were only incrementally faster than the old ones making it hard to justify an upgrade. The result was longer lifecycles for server hardware. A 2020 survey by IDC found 20.3% of respondents holding on to servers for six years and 12.4% keeping servers for seven years or more.
The Oak Ridge Leadership Computing Facility (OLCF), part of Oak Ridge National Labs, is among the latter group. In 2019, it decommissioned Titan, a supercomputer that had been deployed in 2012 but was now beyond obsolete with its antiquated CPUs.
In its place is Crusher, which takes up 1/100 the space of Titan but still has better performance, and undoubtedly better power use, although that has not been officially announced.
Crusher is a Mini Me version of Frontier, an exascale supercomputer due to be deployed this year or next. Both computers are based on the same hardware, but Crusher much smaller and serves as a testbed for applications that will eventually run on Frontier.
The hardware is HPE Cray EX blades with one 64-core AMD EPYC “Trento” CPU and four AMD MI250X GPUs. AMD hasn’t spoken about Trento very much, but it is a derivative of the Milan generation of Epyc processors. The MI250X is from AMD’s new family of enterprise GPUs announced late last year designed to compete against Nvidia’s offerings. On paper their numbers are extremely impressive.
Crusher features 192 HPE Cray EX blades in 1.5 rack cabinets that take up just 44 square feet, according to ORNL. Titan, by contrast, had 200 cabinets that hogged 4,352 square feet, or 100 times the space of Crusher.
Titan maxed out peak performance at 27 petaflops. While we don’t have benchmarks yet for Crusher, we do know that the MI250X has peak double precision floating point performance of 47.9 teraflops. Multiply that by 768 (four per blade times 192 blades) and that comes out to 36.8 petaflops, well beyond Titan’s peak performance. And that’s not even including the CPUs.
We don’t have a measure of the power draw, so we can’t make a comparison between Crusher and Titan, but we do know that Crusher is watercooled, and watercooled systems tend to run at lower power. Titan consumed 8.2 MW of power. No doubt Crusher consumes a lot less. But then again, Crusher was meant to be a smaller system. It’s Frontier that will be the real beast on scale with Titan.
Crusher is currently validating important scientific projects that will eventually be run on Frontier. Since they both have the exact same hardware, projects validated on Crusher should run just fine on Frontier. Of course nothing is guaranteed.
So let this be a lesson. Hardware continues to advance, and even though generation to generation the jump may not be very big, after several years the gap becomes significant and worth the upgrade.
Copyright © 2022 IDG Communications, Inc.